Wikidata:Contact the development team/Query Service and search/Archive/2020/04

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Lag for 3 servers is growing without limit today?

From this Grafana plot it appears that wdqs1005 was taken out for a while yesterday, and then since it came back online it has been keeping up very well with updates. But that is leaving wdqs1004, 6 and 7 in the dust, as their lags have been growing steadily, now up to over 3 hours. Given the "median" maxlag approach and that wdqs1005 and the 3 200* servers are all fine, bot edits are unconstrained at the moment, so that lag will likely continue to grow. What can be done about this? ArthurPSmith (talk) 13:51, 23 March 2020 (UTC)

somehow it stabilized and then started to improve (perhaps because the edit rate slowed down). But today, as of about 3 hours ago, wdqs1005 and wdqs1006 seem to have switched places: now wdqs1006 has very low lag, while 5 (and 4 and 7 still) is growing. Some fiddling with load balancers going on maybe? ArthurPSmith (talk) 18:29, 24 March 2020 (UTC)
I'm keeping an eye on those servers. Repooling wdqs1006 seemed to have helped a bit, but not enough. We are working on a patch to disable the cleanup of values, which might help to improve the update performances (at the cost of some additional disk usage). --GLederrey (WMF) (talk) 12:05, 2 April 2020 (UTC)

Federated query to the categories namespace

Hi, Would it be possible to enable federated queries from WDQS's main namespace to the categories namespace endpoint https://query.wikidata.org/bigdata/namespace/categories/sparql? --Dipsacus fullonum (talk) 19:12, 9 April 2020 (UTC)

The categories endpoint is already white-listed from https://query.wikidata.org/, if you are experiencing issues could you paste the query that is not working as expected? DCausse (WMF) (talk) 14:32, 14 April 2020 (UTC)
Thank you, DCausse (WMF). I never tried any query because the categories endpoint isn't on the list of available endpoints at mw:Wikidata Query Service/User Manual/SPARQL Federation endpoints. I will it try tomorrow. --Dipsacus fullonum (talk) 14:56, 14 April 2020 (UTC)
It is definitely missing in this list, fixed, thanks for bringing this up. DCausse (WMF) (talk) 15:09, 14 April 2020 (UTC)
I have now tried a federated query to the categories endpoint, and the federation works fine. But unfortunately I found out that data is missing in the categories namespace which are already treated at phab:T246568. --Dipsacus fullonum (talk) 21:29, 14 April 2020 (UTC)

It would be nice having a simple way to get truthy statements in WDQS

It is often the case that you need more information about a statement than a simple value. That can be e.g. qualifiers, references, value precision, value unit or others things. In these cases you have to use the p: prefix instead of wdt: but you often still want to access only truthy statements. It is possible to add a test for truthyness, like this:

    ?item p:P1532 ?statement.
    ?statement ps:P1532 ?represents.
    ?statement pq:P582 ?end_time.

    # Select only truthy statements
    ?statement wikibase:rank ?rank.
    FILTER (?rank = wikibase:PreferredRank ||
            ?rank = wikibase:NormalRank && NOT EXISTS { ?item p:P1532/wikibase:rank wikibase:PreferredRank. })

but that is cumbersome, ineffective and may be forgotten. I would really like to have a prefix (maybe pt:) which would give full statements like p:, but only select truthy statements like wdt: does so the example above could be reduced to:

    ?item pt:P1532 ?statement.
    ?statement ps:P1532 ?represents.
    ?statement pq:P582 ?end_time.

Please consider that. --Dipsacus fullonum (talk) 11:32, 13 April 2020 (UTC)

?statement a wikibase:BestRank can be used for this. --Matěj Suchánek (talk) 11:59, 13 April 2020 (UTC)
Thank you, Matěj Suchánek. I had totally overlooked the existence of that type. It will do nicely. --Dipsacus fullonum (talk) 12:11, 13 April 2020 (UTC)

Is there a way to get the OWL file for wikidata?

The only thing that I have found so far is http://wikiba.se/ontology-1.0.owl. Is there a different url / method to retrieve / navigate the ontology?

-- Helt cs (talk) 15:57, 14 April 2020 (UTC)

This file describes the ontology of the RDF model used by wikibase (the software running wikidata) it has nothing "wikidata" specific. If you are looking into an ontology that is related to the content stored in wikidata I'd suggest asking on the Wikidata:Project chat. DCausse (WMF) (talk) 07:19, 15 April 2020 (UTC)
Okay thanks. Will ask there. --Helt cs (talk) 13:15, 15 April 2020 (UTC)

Blank node deprecation in WDQS & Wikibase RDF model

[Also in email reply to posting.]

I believe that https://phabricator.wikimedia.org/T244341#5889997 is inadequate for determining that blank nodes are problematic. First, the fact that determining isomorphism in RDF graphs with blank nodes is non-polynomial is a red herring. If the blank nodes participate in only one triple then isomorphism remains easy. Second, the query given to remove a some-value SNAK is incorrect in general - it will remove all triples with the blank node as object. (Yes, if the blank nodes found are leaves then no extra triples are removed.) A simpler DELETE WHERE will have the seemingly-desired result.

This is not to say that blank nodes do not cause problems. According to the semanticss of both RDF and SPARQL blank nodes are anonymous so to repeatedly access the same blank node in a graph one has to access the stored graph using an interface that exposes the retained identity of blank nodes. It looks as if the WDSQ is built on a system that has such an interface. As the WDQS already uses user-visible features that are not part of SPARQL, adding (or maybe even only utilizing) a non-standard interface that is only used internally would not be a problem.

One problem when using generated URLs to replace blank nodes is that these generated URLs have to be guaranteed stable and unique (not just stable) for the lifetime of the query service. Another problem is that yet another non-standard function is being introduced, pulling the RDF dump of Wikidata yet further from RDF.

So this is a significant change as far as users are concerned that also has potential implementation issues. Why not just use an internal interface that exposes a retained identity for blank nodes?

Peter F. Patel-Schneider (talk) 18:17, 16 April 2020 (UTC)

Thanks for the feedback,
perhaps I should have started by explaining the problem statement and its context. We are experiencing severe performance issues on the process that keeps wikidata and the triple store behind WDQS synced. These performance issues cause edits on wikidata to be throttled. While reviewing the way we do updates on the store we decided to move most of its synchronization/reconciliation process out of the triple store with an objective in mind of sending only the minimal amount information needed to mutate the graph with a set of trivial operations (ADD/REMOVE triples). This is where blank nodes are problematic, I agree with you that by making some assumptions on the current wikibase RDF model a complex isomorphic check is not needed. But even though we might be able to write a relatively simple diff algorithm we need to apply this diff to the triple store and it's where the issue arises. There are no ways to mutate a graph involving blank nodes using a set of trivial INSERT DATA/DELETE DATA operations.
The delete queries you mention were a way to illustrate that a SPARQL DELETE language has to use 'workarounds' to clear-up blank nodes since they are explicitly forbidden and impossible to use in such statements due to their nature (note that SomeValue SNAKs are only used as objects).
About generated IRIs, the plan is to first label blank nodes with stable ids (using the same technique and with the same guarantee of unicity used by values and references). And then in a another step (fourth) eventually change the RDF output to directly emit IRIs for SomeValue SNAKs. The intent is indeed to make them stable and unique (on a best effort basis).

pulling the RDF dump of Wikidata yet further from RDF

Could you eloborate on this? I think the issues around the switch to IRIs have been well summarized by Markus Krötzsch in his comment on this same ticket, quote:
  1. confusing a placeholder "unspecified" IRI with a real IRI that is expected in normal cases (imagine using a FILTER on URL-type property values),
  2. believing that the data changed when only the placeholder IRI has changed (imagine someone deleting and re-adding a quantifier with "unspecified" -- if it's a bnode, the outcome is the same in terms of RDF semantics, but if you use placeholder IRIs, you need to know their special meaning to compare the two RDF data sets correctly)
  3. accidental or deliberate uses of placeholder IRIs in other places (imagine somebody puts your placeholders as value into a URL-type property)
I think only the point 1 applies, for 2 the same IRI will be generated in such scenario and 3 seems unlikely. On OWL Markus adds:

But it does put the data outside of OWL, which does not allow properties to be for literals and IRIs at the same time.

For this I believe that reverting the IRIs to blank nodes would remain an easy step to add to any import process that requires strict OWL semantics.

So this is a significant change as far as users are concerned that also has potential implementation issues.

Could you elaborate more on this especially if one of your usecase would be affected, this might help to determine how significant it is compared to the current problem statement.

Why not just use an internal interface that exposes a retained identity for blank nodes?

It is not clear to me what you are referring to but perhaps it relates to the told blank nodes feature provided by blazegraph, I want to mention that I had tried this approach without much success (maybe related to the difficulty encountered in the last two comments of BLZG-1915 and BLZG-2044). Given that SYSTAP stopped any active maintenance on blazegraph it did not seem wise to me to invest in a feature that does not seem fully finished/tested/implemented. Maybe others who have had more success could share their stories and code? DCausse (WMF) (talk) 14:07, 17 April 2020 (UTC)
Yes, I think that an accurate description of the problem would have been better. Peter F. Patel-Schneider (talk) 15:28, 17 April 2020 (UTC)
Yes, using placeholder IRIs for what should be blank nodes is one change that would pull the dump yet further from RDF, as identified by Markus in his first point. Consumers of the RDF dump (and RDF derived from the dumps) have to have special understanding of these IRIs, no matter where they occur. After all, a later process might take one of these IRIs and use it somewhere else, as alluded to by Markus in his third point. I believe that Markus's second point is relevant, as well. If a blank node label changes between versions of an RDF graph then there is no implication that something real has changed. But if an IRI has changed, then there is stronger implication that this is a real change. So deleting a PropertySomeValueSnak and then inserting the same Snak will look different from deleting a regular Snak and then inserting it. That is, unless the IRI is the same between the old PropertySomeValueSnak and the new one. If consumers need or want to transform these IRIs back to blank nodes they will need to do this for all the RDF the ingest, not just the Wikidata RDF dump.
My usecase is to build a knowledge base from Wikidata and other source for use within my company. There are a lot of modifications required to build a knowledge base from Wikidata. Most of them are related to the poor overall organization if information in Wikidata. Having to look out for and reverse-translate these IRIs is yet another modification, albeit not a major one. Peter F. Patel-Schneider (talk) 15:52, 17 April 2020 (UTC)


So as far as I can tell here is the situation:

  1. The WDQS update process is (somtimes?) experiencing performance problems which has the effect that changes to Wikidata are throttled.
  2. The presence of blank nodes in the Wikidata RDF dump of PropertySomeValueSnaks (and elsewhere?) is a problem for improving this performance.
  3. The WDQS uses Blazegraph, which is no longer maintained. Changes to Blazegraph are not possible (not even trivial ones?).

The proposed solution is to:

  1. Change the Wikidata RDF dump to eliminate the blank nodes arising from PropertySomeValueSnaks.
  2. Change the WDQS update process to a faster one that is not able to handle the blank nodes that currently arise from PropertySomeValueSnaks.
  3. Change Blazegraph to support a new way of identifying the new mapping of PropertySomeValueSnaks in the Wikidata RDF dump. (What happened to the other SPARQL changes that were noted in the phabricator discussion?)

Note that the output of the WDQS will change because of the change to the Wikidata RDF dumps.

I understand the problems having to deal with unmaintained software. However, it looks to me that the solution requires changes to Blazegraph anyway. I also do not understand why it is difficult to preserve blank node IDs in Blazegraph. (Of course there may be reasons for this to be difficult that I do not know of.) It would not even be necessary to implement these changes to Blazegraph in general - only the changes necessary to make the WDQS update process work are needed. (Yes this is far from ideal, but the overall situation here is far from ideal.)

I also do not understand why these changes to the WDQS require breaking changes to the Wikidata RDF dump. It seems to me to be easy to keep the Wikidata RDF dump unchanged (except for the labels of blank nodes in it). Just turn the blank nodes, which will have unique and stable labels, into IRIs with the IRI based on the blank node label during the WDQS updating process. This would eliminate the more far-reaching breaking change and keep the only breaking change to the WDQS. (I see that you have actually proposed this.)

And as far as that goes, it appears to me that it is possible to use an update process that does not internally tolerate blank nodes without any breaking changes to the WDQS interface and without any significant changes to Blazegraph. First make the above changes. Then translate isBlank() (and some other SPARQL functions) in SPARQL queries to a new function that checks for either a blank node or one of the special IRIs resulting from PropertySomeValueSnaks. (This doesn't seem to be significantly different from the new function in the proposed changes.) Finally transform the special IRIs in SPARQL results back to blank nodes. This can be done in the WDQS interface itself. (Federated queries might also require some work.)

Of course making changes to Blazegraph would probably result in performance benefits, but it seems to me that a non-breaking change is to be preferred over a breaking change and if a breaking change is needed it is better to break as little as possible.

Peter F. Patel-Schneider (talk) 16:25, 17 April 2020 (UTC)

You summarized the problem pretty well. If I try to sort the problems raised so far by dangerousness/annoyance and their possible responses it would look like this as I understand it:
  • Queries using isBlank() will be broken
    • The suggested plan: mitigate the issue by introducing a new function wikimedia:isSomeValue() for explicitly filtering SomeValue
    • You propose: to adapt the SPARQL engine to consider these SomeValue IRIs as blank nodes and thus adapt isBlank() to return true when encountering such nodes.
  • Conflating classic IRIs with SomeValue IRIs (use of isURI/isIRI)
    • The suggested plan: nothing, queries using isIRI/isURI will have a risk to conflate SomeValue IRIs and thus would have to be verified.
    • You propose: adaptation of the SPARQL engine so that isIRI/isURI returns false when encountering SomeValue IRIs.
  • Consumers of WDQS results will have to understand the meaning of these SomeValue IRIs:
    • The suggested plan: nothing, consumers explicitly relying on the presence of blank nodes in the SPARQL results will have to be adapted.
    • You propose: if I understood correctly you suggest translating the results emitted by the triple store and revert these IRIs back to blank nodes.
  • This takes the wikibase dump further away from RDF
    • The suggested plan: (while not strictly required to address the problem statement) it was added to limit the divergences between the dumps and WDQS results. If we were using well known IRIs (RDF 1.1 3.5) as proposed in the next comment below would this allow to remain RDF compliant and help this proposal?
    • You suggest: not applying step 4. I'm not against stopping before step 4 and leaving the RDF dumps with stable and unique blank node labels at the cost of another divergence between the dumps and WDQS.
  • This takes the DUMP and WDQS output out of OWL
    • On this point I'm not knowledgeable enough to judge how important this is and how much OWL is expected to support RDF 1.1 and if using well known IRIs could help here.
  • believing that the data changed when only the placeholder IRI has changed,
    • The way we plan to generate the URIs will prevent this, the generated ID is like a coordinate and thus cannot be different if removed/re-added add the same place.
  • accidental or deliberate uses of placeholder IRIs in other places (imagine somebody puts your placeholders as value into a URL-type property)
    • If we decide to use well known IRIs as suggested in "Replacing Blank Nodes with IRIs" then wikibase will have to indeed prevent to add such IRIs manually to the system.
I hope I did not forget anything.
You message seems to also ask the following question: Given that blazegraph has to be adapted anyways why not adapt it in such a way that this modification reamains transparent to users?.
It think the changes are different by nature, changing the SPARQL engine as suggested requires some research even if at a glance changing the behavior of isBlank/isIRI and transforming the results sounds enough. But how do we determine that there are no other intricacies that would not in the end result in a breaking change anyways. Such breaking change would remain invisible which I believe is a worse outcome than an explicit breaking change.
If I try to sum-up the possible outcome of this discussion it would be:
  • We abandon this plan
    • It will undermine our ability to address the performance issues as stated in the problem statement
  • We move forward with this plan with possibly some amendments
  • We investigate other strategies that do not involve a breaking change
    • It somewhat translates into abandoning the current plan and thus increases the risk to undermine our ability to address the performance issues as stated in the problem statement
    • Investigate other strategies (even if I'm suspicious that we could find a workable solution in a reasonable amount of time):
      • Use IRIs internally but hide them from WDQS clients by adapting blazegraph and make sure that no subtle breaking changes are introduced
      • Fix/adapt blazegraph so that its told blank node is better integrated with the rest of the stack
DCausse (WMF) (talk) 12:41, 20 April 2020 (UTC)

It's worth noting in this context that the "RDF 1.1 Concepts and Abstract Syntax" specification explicitly discusses the possibility of "Replacing Blank Nodes with IRIs" (Section 3.5); however, it is stated that such "skolemization" of blank nodes should be done by using a ".well-known" (per RFC8615) IRI under the registered name "genid", i.e. an IRI of the form

http://example.com/.well-known/genid/<unique_identifier>

or

https://example.com/.well-known/genid/<unique_identifier>

Otherwise, the introduced "Skolem" IRI may not be recognizable as having been introduced solely to replace RDF blank nodes.

Thank you, this is something we should consider indeed. I added it as a possible amendment to the discussion above. DCausse (WMF) (talk) 12:41, 20 April 2020 (UTC)

Lag is going up (April 18)

@DCausse (WMF), Lucas Werkmeister (WMDE), GLederrey (WMF):

Maybe you already got an automated email about the raising problem .. if not, please have a look. --- Jura 11:53, 18 April 2020 (UTC)

Yes, @Adam Shorland (WMDE): has taken actions to workaround the lag by switching 100% of the traffic to the codfw datacenter to help machines in the eqiad DC to catchup the lag. DCausse (WMF) (talk) 15:06, 18 April 2020 (UTC)
Thanks to both of you. I noticed it went down again. --- Jura 15:15, 18 April 2020 (UTC)

SPARQL GUI for test.wikidata.org

As we know query.wikidata.org is the SPARQL GUI for www.wikidata.org. Is there any SPARQL GUI for test.wikidata.org? Gbergamin (talk) 08:37, 21 April 2020 (UTC)

Hello @Gbergamin:, no, there's no SPARQL GUI for test.wikidata.org. Is there something specific you'd like to do? Lea Lacroix (WMDE) (talk) 11:56, 21 April 2020 (UTC)

Unit = P199 (April 28)

SELECT (COUNT(*) as ?count) { ?s wikibase:quantityUnit wd:P199 }

Try it!

Looking for something else, I just noticed that Query Server still has still both

  • wikibase:quantityUnit wd:Q199 and
  • wikibase:quantityUnit wd:P199

for some quantities of statements created today. 1 (Q199) is being used when a quantity has no unit.

What was the phab ticket for fixing this? Is it still open? --- Jura 11:10, 28 April 2020 (UTC)

@Jura1: Yes, it is Phab:T230588 and it is open and has high priority. --Dipsacus fullonum (talk) 11:15, 28 April 2020 (UTC)
The ticket mentions a re-import that should solve it, but the statement with P199 is one created today.
So there is probably another bug in Wikibase.
Hopefully, it's limited to quantityUnit-triples and only for those without a "real" unit. --- Jura 11:21, 28 April 2020 (UTC)
@Jura1: Most but not all cases with properties as units are with P199. Here is a count:
SELECT ?unit (COUNT(?unit) AS ?count)
WHERE
{ 
  ?value wikibase:quantityUnit ?unit .
  ?unit a wikibase:Property.
}
GROUP BY ?unit
Try it!
--Dipsacus fullonum (talk) 11:32, 28 April 2020 (UTC)
Correction: The query above only finds these cases there the wrong unit coincides with a real property. In most cases it will not. I will try to make a query for all cases if poosible with timeout. --Dipsacus fullonum (talk) 11:47, 28 April 2020 (UTC)
SELECT ?unit (COUNT(*) as ?count)
WHERE { ?s wikibase:quantityUnit ?unit }
GROUP BY ?unit
Try it!

finds all units, including 28 statements/1688 triples.

SELECT ?calendarmodel (COUNT(*) as ?count)
WHERE { ?s wikibase:timeCalendarModel ?calendarmodel }
GROUP BY ?calendarmodel
Try it!

gives 13879 + 305. --- Jura 12:06, 28 April 2020 (UTC)

Here is a version which tells what the real/intended unit is. It also finds 5 items (units) with no wikibase:sitelinks predicate which all items should have, and 17 uses of the unknown to me IRI wd:undefined:
SELECT ?unit ?count ?real_unit ?real_unitLabel
WITH
{
  SELECT ?unit (COUNT(?unit) AS ?count)
  WHERE
  { 
    ?value wikibase:quantityUnit ?unit .
  }
  GROUP BY (?unit)
} AS %get_all_units
WHERE
{
  INCLUDE %get_all_units
  OPTIONAL { ?unit wikibase:sitelinks ?sitelinks . }
  FILTER (! BOUND(?sitelinks))
  BIND (IRI(REPLACE(STR(?unit), "P", "Q")) AS ?real_unit)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}
Try it!
--Dipsacus fullonum (talk) 12:20, 28 April 2020 (UTC)
PS. The items without wikibase:sitelinks predicate are redirects. --Dipsacus fullonum (talk) 12:23, 28 April 2020 (UTC)
Thanks for looking into this, just a quick update here: the count query on ?s wikibase:quantityUnit wd:P199 returns 0 result on the test host we launched the reload. We'll update other production machines ASAP, sorry for the inconvenience.DCausse (WMF) (talk) 18:44, 28 April 2020 (UTC)
SELECT * 
WHERE
{
    wd:Q163320 p:P1106 / psv:P1106 ?a .
    ?a wikibase:quantityUnit ?unit .
}
Try it!

@DCausse (WMF): reload is just between Wikidata and Query Server? The above statements were created today and include a P199. --- Jura 18:52, 28 April 2020 (UTC)

@Jura1: the line wdv:b12dd47b72560e97123004e32b3726e6 wd:P199 seen on the query service is a bit misleading, values (like wdv:b12dd47b72560e97123004e32b3726e6) being shared between multiple statements are not necessarily always new. This line is still an artifact of the initial bug. I pasted the output of this same query I ran on the server that has been reloaded here: phab:P11067. The culprit `P199` is no longer there, hopefully all these incoherent values will go away after the reload. DCausse (WMF) (talk) 19:23, 28 April 2020 (UTC)
Oh, interesting discovery. At least now I understand why you insist on reloading. Thanks for explaining it. --- Jura 19:28, 28 April 2020 (UTC)