Wikidata talk:SPARQL query service/query optimization

From Wikidata
Jump to navigation Jump to search

hint:Query hint:optimizer "Runtime" . hint:Query hint:maxParallel 50[edit]

Can I ask for some guidance on when is it appropriate to deploy: hint:Query hint:optimizer "Runtime" . hint:Query hint:maxParallel 50 . and when not? And how many? Blazegraph documentation is delphic. [1] --Tagishsimon (talk) 16:25, 26 March 2019 (UTC)[reply]

Some rules of thumb regarding optimization[edit]

At the "Blazegraph Alternatives" feedback session on 13 April 2022 (etherpad), the topic of query optimization was raised. It was noted that query hints can be useful but can be overused. It was noted that when the query logs were investigated, it was discovered that (sometimes) the use of query hints made performance worse!

So, in an effort to provide some clarity, here are some specific suggestions for writing SPARQL queries:

  • Always place the most selective triple patterns in the query, as early as you possibly can
    • The goal is to bind as many of the variables (those parts of a triple pattern starting with "?") as soon as you can
    • Binding a variable means associating it with one or more IRIs or literal values
    • A triple pattern that uses the IRI of a specific instance (for example, wd:Q42 for Douglas Adams) is very selective, and having both an IRI and specific predicate is extremely selective
      • For example, wd:Q42 wdt:P937 ?o retrieves the instance that is the work location of Douglas Adams (which is London)
    • Specifying a literal as the object of a query triple is even more selective than an IRI, but remember to use language tags or define the data type of the literal
      • For example, ?item rdfs:label "Orangey"@en returns the instance, Q677525 (a cat)
      • As another example, ?item wdt:P569 "1952-07-01"^^xsd:dateTime returns 143 individuals whose date of birth is 1 July 1952
    • Typically (although not always), subjects are more selective than objects, and predicates are the least selective
    • Of all the predicates, P31 (instance of) and P279 (subclass of) are the least selective
  • When evaluating a set of triple patterns in a query (the triples separated by ";" or "."), the goal is to bind as many of the variables in the pattern as early as possible
    • So, organize the triples to make this as easy as possible
    • Subject-subject joins are typically more selective than subject-object or object-object joins
      • An example of a subject-subject join is ?item wdt:P19 ?loc1 ; wdt:P20 ?loc2 . (asking for an instance's place of birth and death)
      • An example of an object-object join is ?item1 wdt:P19 ?loc . ?item2 wdt:P20 ?loc . (asking for individuals who were born or died in a certain location, note that these do not have to be the same individual)
    • As noted above, subject-predicate joins are the most selective
  • Try first to execute a query without any hints, then add the hints and see if the performance is better or worse  – The preceding unsigned comment was added by AWesterinen (talk • contribs) at 19:42, 17 April 2022‎ (UTC).[reply]
@AWesterinen: We start to cover this at Wikidata:SPARQL_query_service/query_optimization#Automatic_optimisation_--_background_overview a bit, but you're right, it could be expanded. An important point though is that Blazegraph does most of these join-order optimisations automatically, and will happily override the order it's given (if it's not told not to). While it's not a bad idea for the user to be conscious of trying to keep the size of the solution-set growing as little as possible as the query progresses, most of the time this isn't something they need to worry about, as the optimiser should handle it.
The really valuable thing, perhaps, is to get a sense of when the optimiser may tend to get it wrong.
For example, a classic case has been any query which involves finding out which properties are most in a particular context (often an intensive query in itself), and then giving the names of them. Such a query will usually include a line like ?prop_wd wikibase:directClaim ?prop_wdt, to map the wdt: form of the property most used in statements to the wd: form, for which the label is defined.
The problem here is (or at least has been until recently) that because there are comparatively few properties, so comparatively few wikibase:directClaim statements, the optimiser tends to identify this to do first (as the most selective triple pattern), when actually most of these mappings will never be needed at all (since only a fraction of the full set of properties will end up in the final report), so the effect of doing this look-up early is to blow up the size of the running solution set, as it is made to include a Cartesian product between what's needed and a lot of extra stuff most of which is going to be completely unnecessary.
This is why older queries doing this have tended to include the hint hint:Query hint:optimizer "None" to turn the optimiser off -- but yes, that does mean the user then really does need a very good awareness of what they are doing.
More recently queries with this kind of issue to deal with eg at query of the week have instead tended to use named subqueries to control the order of execution (ie Blazegraph's WITH { <subquery> } AS %set) -- it makes for a much more understandable and obvious way to make sure some intermediate is materialised first (eg here the properties most commonly-used in some context), before those results are then decorated or used further in some way (eg here by mapping to wd: form to add labels).
If one were to assemble the last few year's worth of queries from query-of-the-week, I think you would find the proportion using named subqueries for one purpose or another would perhaps be surprisingly large. I was a bit surprised therefore to see that this didn't seem to be one of the rows being looked for and assessed in the Blazegraph replacement assessments? IMO it is a very valuable feature, that we should be presenting to the candidate teams as a thing we are very definitely looking for. Jheald (talk) 23:28, 18 April 2022 (UTC)[reply]
I'm genuinely curious, is there anything named subqueries can do that ordinary subqueries can not do, besides allowing for efficient reuse of results? The only truly useful thing I've found this does is preventing the inefficiencies of the label service, but a quick test suggests the same can be done with a regular subquery. And if ordering of operations is an issue can't the optimizer be coaxed either by using nested subqueries or turning the optimizer off altogether? Might look less visually pleasing though Infrastruktur (talk) 15:13, 19 April 2022 (UTC)[reply]
@Infrastruktur: The "efficient reuse of results" can make or break whether a query completes in 60 seconds, if the same intermediate results set is re-used in more than one place.
Beyond that IMO they make for queries that are far more readable, with a far more intuitively understandable flow, than when arranged in nested form. I'm also not sure if nested subqueries would definitely be run first, in the way that seems to be guaranteed for named subqueries.
Also of course there is the issue that we have a large number of existing queries that use named subqueries; so that, at the very least, we would require a pre-processor to be available, able to re-order them into nested form. Jheald (talk) 18:03, 19 April 2022 (UTC)[reply]
It is regrettable that geeks like us don't have forum in which we can discuss advanced topics. Some people might be satisfied by entering simple data, but for us, we are the kind of people who prefer to solve this week's crossword puzzle. Not unlike addicts to crack-cocaine I imagine. Infrastruktur (talk) 21:11, 19 April 2022 (UTC)[reply]
:-) -- but I do find that twitter, the WD group on telegram, or pages like WD:RAQ, are all quite good forums. Jheald (talk) 23:52, 19 April 2022 (UTC)[reply]