Wikidata:SPARQL query service/оптимізація запиту

From Wikidata

< Wikidata:SPARQL query service

Jump to navigation Jump to search

This page is a translated version of the page Wikidata:SPARQL query service/query optimization and the translation is 28% complete.

Оптимізація запитів намагається знайти найефективніший спосіб виконання заданого запиту серед різних можливих планів виконання запиту. Blazegraph має вбудований оптимізатор запитів, який часто працює добре. Однак іноді оптимізація виконується не так успішно; у таких випадках запити, можливо, доведеться оптимізувати вручну.

Стратегії оптимізації

a satanic altar, symbolizing black magic — Оптимізація запитів (символічне зображення)

Фіксовані значення та діапазони

Searching for fixed values, e. g. wdt:P31 wd:Q5, is the cheapest option in the query service, but searching for ranges of values is also usually pretty efficient, and often preferable to other options. For example, to look for people born in 1978, FILTER(YEAR(?dateOfBirth) = 1978) is not nearly as efficient as the following:

SELECT ?item ?dateOfBirth WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P569 ?dateOfBirth.
  FILTER("1978-00-00"^^xsd:dateTime <= ?dateOfBirth &&
         ?dateOfBirth < "1979-00-00"^^xsd:dateTime)
}

You can further optimize this by informing the query service that wdt:P569 is a range-safe predicate:

SELECT ?item ?dateOfBirth WHERE {
  ?item wdt:P31 wd:Q5;
        wdt:P569 ?dateOfBirth. hint:Prior hint:rangeSafe true.
  FILTER("1978-00-00"^^xsd:dateTime <= ?dateOfBirth &&
         ?dateOfBirth < "1979-00-00"^^xsd:dateTime)
}

This tells the optimizer that ?dateOfBirth doesn’t mix dates, strings, integers or other data types, which simplifies the range comparison. This is almost always true on Wikidata, so the main reason not to use this optimizer hint all the time is that it’s a bit inconvenient. (It’s probably also not a good idea to use it when you’re working with unknown values.)

Шляхи властивості

The optimizer generally assumes property paths to be more expensive than simple value matches, and usually that’s correct: try to add some simple triples to reduce the amount of values that have to be checked against the path.

However, sometimes paths can be fairly efficient, and it’s good to inform the optimizer of that fact. For example, consider the following query for all municipalities in the German state of Lower Saxony:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31/wdt:P279* wd:Q262166;
        wdt:P131+ wd:Q1197.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

As written, this will time out because the query service will attempt to start with all instance of (P31)municipality in Germany (Q262166) (including subclasses) before limiting them to located in the administrative territorial entity (P131)Lower Saxony (Q1197) (recursively). It’s much more efficient to do it the other way around:

SELECT ?item ?itemLabel WHERE {
  hint:Query hint:optimizer "None".
  ?item wdt:P131+ wd:Q1197;
        wdt:P31/wdt:P279* wd:Q262166.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Tracked in Phabricator
Task T168973

You can also tell the optimizer in which direction to traverse a path. For example, say you want to test whether Moby-Dick (Q174596) is a creative work (Q17537576):

ASK {
  wd:Q174596 wdt:P31/wdt:P279* wd:Q17537576.
}

This takes several seconds because the optimizer decides to start at creative work (Q17537576) and walk the subclass of (P279) links backwards. You can tell it to go the other way around, starting at Moby-Dick (Q174596), following one instance of (P31) link and then walking the subclass of (P279) link forwards:

ASK {
  wd:Q174596 wdt:P31/wdt:P279* wd:Q17537576.
  hint:Prior hint:gearing "forward".
}

This is much more efficient. To go the other way, use "reverse" instead of "forward".

Зворотні шляхи властивості

Інша техніка для спонукання оптимізатора до надання результатів полягає в інвертуванні шляху властивості, що використовується в запиті; ймовірно, це особливо корисно при роботі з тайм-аутами звітів, що мають конструкції wdt:P31/wdt:P279*. Розглянемо такі приклади: перший перевищує ліміт часу, а другий працює швидко:

# Museums in Northern Ireland - TIMES OUT
SELECT distinct ?museum ?museumLabel WHERE {
  ?museum wdt:P131* wd:Q26 .
  ?museum wdt:P31/wdt:P279* wd:Q33506 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

# Museums in Northern Ireland - SUCCEEDS
SELECT distinct ?museum ?museumLabel WHERE {
  ?museum wdt:P131* wd:Q26 .
  wd:Q33506 ^wdt:P279*/^wdt:P31 ?museum .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Додаткову інформацію про шлях властивості можна знайти в стандарті SPARQL.

Налаштування порядку

Ми вже кілька разів вимикали оптимізатор, коли він намагався неправильно впорядкувати запит. Ви також можете застосувати детальніший контроль до оптимізатора, повідомляючи йому, яке об’єднання має виконуватися першим чи останнім, використовуючи підказки hint:Prior hint:runFirst true або hint:Prior hint:runLast true.

Важливо пам’ятати, що вони контролюють перше або останнє приєднання (join), що відповідає більш-менш . між двома трійками – не трійці безпосередньо. Зокрема, ви не можете розміщувати цю підказку після першої трійки в групі або (я думаю) після трійки, що містить шлях властивості. Дивіться також розділи про іменовані підзапити нижче, щоб дізнатися про додаткові параметри керування порядком виконання запитів.

Затримка виклику служби міток

The label service (SERVICE wikibase:label { … }) can make queries much less efficient – when working with an expensive query, it’s often a good idea to disable the service at first, try to optimize the query as much as possible, and only then turn it back on.

With some luck, the query will be efficient enough that the query service doesn’t cause it to time out now.

If the query works without the label service but results in a timeout when it is added, it can help to extract the bulk of the query into a subquery, and only apply the label service at the very end. You can do this with a regular SPARQL subquery:

SELECT ?foo ?fooLabel ?bar ?barLabel … WHERE {
  { # Wrapper for label-less subquery
    SELECT ?foo ?bar … WHERE {
      …
    }
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

This can be particularly important if you are using LIMIT to get back a few quick results from a query that would otherwise be very expensive.

The label service tries to run last, which means Blazegraph tries to materialise all the results of a query before moving on to adding the labels, only then applying the LIMIT directive.

This can be avoided if the expensive part of the query is put into a subquery, as above, and the LIMIT is placed on the subquery, so the labels are looked up only for those few results.

Another way to reduce usage of the label service is using rdfs:label. To get all proteins and their labels:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" . }
}

will timeout because 1 million hits the limit for the label service. Instead get all proteins with labels:

SELECT ?item ?itemLabel WHERE {
  ?item wdt:P31 wd:Q8054 .
  ?item rdfs:label ?itemLabel. FILTER( LANG(?itemLabel)="en" )
}

Note that this will output only those items that have an English label, though.

Іменовані підзапити

Як зазначено нижче, оптимізатор іноді намагається оптимізувати весь запит за допомогою підзапитів, тому розміщення частини запиту в підзапиті не завжди допомагає. Ви можете виправити це, використовуючи розширення BlazeGraph, яке називається іменованими підзапитами (named subqueries):

SELECT ?item ?itemLabel… 

WITH {
  SELECT ?item … WHERE {
    … 
  }
} AS %results 

WHERE {
  INCLUDE %results.
    … 
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Іменовані підзапити гарантовано виконуватимуться лише один раз, відокремлено, і саме цього ми тут і хочемо.

Подальше використання іменованих підзапитів

The above example only uses one named subquery, but you can use any number of them. This can be useful to impose a general order or structure on the query, while still letting the optimizer rearrange the instructions within each subquery as it knows best.

You can also disable the optimizer only in one named subquery, while still letting it optimize the rest of the query: for this, specify hint:SubQuery hint:optimizer "None" in that subquery instead of the usual hint:Query hint:optimizer "None".

See User:TweetsFactsAndQueries/Queries/name phrases for an example using this technique.

Named subqueries are also useful when you want to group results multiple times, or want to use the same subquery results more than once in different contexts – you can INCLUDE the same named subquery into multiple other ones, if necessary (and it will be run only once). See User:TweetsFactsAndQueries/Queries/most common election days for an example of this.

Пошук міток

If you want to find all items that have a certain word, say "Frankfurt" in a label, you could try a query like this:

SELECT ?item ?label
WHERE
{
  ?item rdfs:label ?label.
  FILTER CONTAINS(?label, "Frankfurt")
}

But that query will timeout because the query engine would have to the read hundreds of millions of labels to find the word. But Wikidata's normal search function have all words in labels indexed and can make the search very fast. It is possible for a query to access Wikidata's search function with the MediaWiki API Query Service. The query below also finds items with the word "Frankfurt" in a label but uses the MediaWiki API, and can therefore run without timeout:

SELECT ?item ?label
WHERE
{
  SERVICE wikibase:mwapi
  {
    bd:serviceParam wikibase:endpoint "www.wikidata.org";
                    wikibase:api "Generator";
                    mwapi:generator "search";
                    mwapi:gsrsearch "inlabel:Frankfurt";
                    mwapi:gsrlimit "max".
    ?item wikibase:apiOutputItem mwapi:title.
  }
  ?item rdfs:label ?label.
  FILTER CONTAINS(?label, "Frankfurt")
}

The MWAPI search can also include search for labels in selected languages only, and search statements and qualifiers and more. Please see mw:Help:Extension:WikibaseCirrusSearch for a description of the search options.

За можливості використовуйте COUNT(*), а також швидкі підрахунки діапазону

Якщо ви хочете дізнатися кількість осіб у Вікіданих, ви можете зробити такий запит:

SELECT (COUNT(?item) AS ?count)
{
  ?item wdt:P31 wd:Q5 .
}

Він виглядає просто, але для виконання потрібно багато часу. Механізм SPARQL має перевіряти, чи зв’язана ?item і чи є значення без помилки понад 8 мільйонів разів, тому виконання запиту зазвичай займає 20–30 секунд. Але ми знаємо, що ?item завжди зв’язана і що значення завжди є елементами, тому краще просто підрахувати кількість результатів без умов за допомогою такого запиту:

SELECT (COUNT(*) AS ?count)
{
  ?item wdt:P31 wd:Q5 .
}

Тут рушію не треба перевіряти кожен результат, щоб зробити підрахунок, а для виконання запиту потрібно менше секунди.

Хоча підрахунок кількості результатів швидший, ніж підрахунок кількості певних зв’язків загалом, причина, чому цей приклад є особливо швидким, полягає в тому, що оптимізатор перепише підрахунок через один триплетний шаблон для використання швидкого підрахунку діапазону. Ця оптимізація не спрацює, якщо ви спробуєте підрахувати набір, який складається з більш ніж одного триплетного шаблону.

Сканування унікальних термів, оптимізація групування та підрахунку

If you try to list unique predicates like so, the operation will time out.

SELECT ?p
WHERE { [] ?p [] . }
GROUP BY ?p

However, if you alter this query slightly, it will be rewritten by the optimizer to use a distinct term scan, which is quite fast. This only works on a single triple pattern.

SELECT DISTINCT ?p
WHERE { [] ?p [] . }

There is also a simple group by and count optimization that will combine both the distinct term scan and the fast range count optimizations, the latter for every predicate. This is a lightningly fast way to get counts for all the different predicates. If you need only counts for a subset of the predicates it will most likely be fastest to start with this, wrap it unchanged in a named subquery, then reduce the set down to what you want.

SELECT ?p (COUNT(?p) AS ?count)
WHERE { [] ?p [] . }
GROUP BY ?p

This optimization should work for any component of the triple.

SELECT ?o (COUNT(*) as ?count)
WHERE { ?s wdt:P31 ?o }
GROUP BY ?o

Count sizes of huge subsets intersections with WikibaseCirrusSearch

Blazegraph often fails when counting sizes of intersections of huge subsets, for example, it is impossible to count number of humans without gender property set with only SPARQL-based optimizations. However with MWAPI and WikibaseCirrusSearch it works extremely fast.

SELECT ?label ?count WHERE {
  {
    BIND("People without gender" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 -haswbstatement:P21" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  } UNION {
    BIND("Male" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 haswbstatement:P21=Q6581097" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  } UNION {
    BIND("Female" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 haswbstatement:P21=Q6581072" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  } UNION {
    BIND("Other" AS ?label)
    SERVICE wikibase:mwapi {
      bd:serviceParam wikibase:endpoint "www.wikidata.org";
                      wikibase:api "Search"; wikibase:limit "once" ;
                      mwapi:srsearch "haswbstatement:P31=Q5 haswbstatement:P21 -haswbstatement:P21=Q6581097 -haswbstatement:P21=Q6581072" ;
                      mwapi:srlimit "1" ; mwapi:srprop "" ; mwapi:srsort "none" ; mwapi:srnamespace "0" .
      ?count wikibase:apiOutput '//searchinfo[1]/@totalhits'.
    }
  }
}

Служби

GAS Service

The Gather, Apply, and Scatter (GAS) service provides graph traversal, graph mining, and similar classes of algorithms for SPARQL. In practical terms, it enables a series of relations to be followed through the graph; for instance the chain of father, grandfather, great-grandfather &c - tracing the father (P22) line - of a subject item, or a chain of connected railway stations through adjacent station (P197) statements. It is documented here & here.

In essence, GAS takes two parameters: gas:in wd:Qnnn - the node or item from which to start, and gas:linkType wdt:Pnnn - the property to be followed. It outputs gas:out - the next item; gas:out1 - the number of hops from the input item and gas:out2 - the item immediately preceding the gas:out item in the queried chain.

# Male lineage of Elizabeth II (Q9682)
SELECT ?father ?fatherLabel ?child ?childLabel ?depth WHERE
 {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q9682 ; gas:linkType wdt:P22 ; gas:out ?father ; gas:out1 ?depth ; gas:out2 ?child . }

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?depth

# UK railway stations connected from Topsham railway station (Q1990205) 
#defaultView:Map{"hide":["?cds","?line","?layer"],"layer":"?trackLabel"}
SELECT ?station ?stationLabel ?cds ?line ?trackLabel ?track ?image {
{SELECT * {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q1990205 ; gas:linkType wdt:P197 ; gas:out ?station ; gas:out1 ?depth ; gas:out2 ?pred . }
  FILTER(?station != wd:Q1990205)
                      ?station wdt:P625 ?cds ;
                               wdt:P17 wd:Q145 .
                             
  ?station p:P197 ?adjST .
  ?adjST ps:P197 ?adj ;
                pq:P81 ?track .

  ?adj p:P625/psv:P625/wikibase:geoLatitude ?lat1 ; p:P625/psv:P625/wikibase:geoLongitude ?lon1 .
  ?station p:P625/psv:P625/wikibase:geoLatitude ?lat2 ; p:P625/psv:P625/wikibase:geoLongitude ?lon2 .
  BIND(CONCAT('LINESTRING(', STR(?lon1), ' ', STR(?lat1), ',', STR(?lon2), ' ', STR(?lat2), ')') AS ?str) . BIND(STRDT(?str, geo:wktLiteral) AS ?line)}
 
}
  OPTIONAL { ?station wdt:P18 ?image .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}

It is possible to look at inverse paths to a subject item, for instance, people in a matrilineal line descended from Elizabeth II (Q9682), by adding the parameter gas:traversalDirection "Reverse".

# Female lines descended from Elizabeth II (Q9682)
SELECT ?child ?childLabel ?mother ?motherLabel ?depth WHERE
 {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q9682 ; gas:linkType wdt:P25 ; gas:traversalDirection "Reverse"; gas:out ?child ; gas:out1 ?depth ; gas:out2 ?mother . 
  }

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?depth

It is also possible to follow both directions, from and to the item, using gas:traversalDirection "Undirected", although as in the example below, this results in confusion in the parent & child columns.

# Female line leading to, and descended from, Elizabeth II (Q9682)
SELECT ?child ?childLabel ?mother ?motherLabel ?depth WHERE
 {
  SERVICE gas:service {
       gas:program gas:gasClass "com.bigdata.rdf.graph.analytics.BFS" ; gas:in wd:Q9682 ; gas:linkType wdt:P25 ; gas:traversalDirection "Undirected"; gas:out ?child ; gas:out1 ?depth ; gas:out2 ?mother . 
  }

SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
} order by ?depth

The documentation discusses some additional parameters and GAS algorithms.

Автоматична оптимізація -- попередній огляд

The basic approach to query optimisation is to try to get the solution set to be as small as possible as soon as possible in the query execution, to make the work needed for subsequent joins as small as it can be.

A subsidiary goal can be to maximise the possibility of pipelining and parallelisation. This is what Blazegraph's built-in query optimiser tries to achieve.

For example, here is a query to return a list of U.S. Presidents and their spouses:

SELECT ?pres ?presLabel ?spouse ?spouseLabel WHERE {
   ?pres wdt:P31 wd:Q5 .
   ?pres wdt:P39 wd:Q11696 .
   ?pres wdt:P26 ?spouse .
   SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
   }
 }

The details of how Blazegraph has executed a particular query (and why) can be obtained via the query engine's EXPLAIN mode. This can be done by replacing the following at the start of the query-editor URL,

https://query.wikidata.org/#

with

https://query.wikidata.org/bigdata/namespace/wdq/sparql?explain&query=

to send the query to the SPARQL endpoint directly (not via the query editor), with an extra explain& in the URL before the query= string.

The resulting URL, sending the query above to the endpoint directly, with the 'explain' option, produces this report on the query and its optimisation.

By default, a query would execute from top to bottom, with solution-set joins made in the order given. This is shown as the "Original AST" plan in the report. However, the query engine estimates (as of September 2015) that there are

2,900,146 humans in the database with instance of (P31) human (Q5),
76 U.S. presidents (including fictional ones!) with position held (P39) President of the United States (Q11696), and
38,836 triples with the predicate spouse (P26).

It therefore concludes that the most efficient way to run the query, rather than the order specified, is instead to first find the presidents (76), then see how many of them have spouses (51 solutions), then see how many are human (47 solutions), before sending this solution set to the labelling service. This rearrangement is typical of successful query optimisation.

Запит із труднощами

The following query attempts to find all statements referenced to the Le Figaro website, and return their subject, predicate and object. However as written it times out.

SELECT ?statement ?subject ?subjectLabel ?property ?propertyLabel ?object ?objectLabel ?refURL WHERE {
  ?statement prov:wasDerivedFrom ?ref .
  ?ref pr:P854 ?refURL 
  FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) .       
  ?subject ?p ?statement .
  ?property wikibase:claim ?p .
  ?property wikibase:statementProperty ?ps .
  ?statement ?ps ?object
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}

The reason for this can be investigated by looking at the explain report for the query.

It appears that the query optimiser is seduced by the low cardinality of the statements ?property wikibase:claim ?p and ?property wikibase:statementProperty ?ps, with only 1756 properties that take items as their objects, without anticipating that (i) the second condition will do nothing to reduce the solution set from the first condition; and (ii) these conditions will do very little to reduce the cardinality of the statement ?subject ?p ?statement (764,223,907), which the optimiser proposes to join before looking at ?statement prov:wasDerivedFrom ?ref and then ?ref pr:P854 ?refURL.

The query optimiser may also be trying to defer testing the statement FILTER (CONTAINS(str(?refURL),'lefigaro.fr')) for as long as possible, as this requires materialisation of actual string values, rather than proceeding solely with joins based on whether items exist in statements or not.

Update: as for 2020, there are more than 47000000 url references in Wikidata. As Blazegraph in current configuration does not perform full-text indexing for text values, there is no way to create quick "native" query without adding extra filters (e. g. "find all statements referenced to the Le Figaro, but only for presidents"). Luckily, there is an out-of the box solution, which involves MWAPI. Note, that this query is slightly different from original request: it returns links for domains matching *.lefigaro.fr, it won't capture links like http://web.archive.org/web/20200904002919/http://www.lefigaro.fr/. This query looks like this:

SELECT ?statement ?subject ?subjectLabel ?property ?propertyLabel ?object ?objectLabel ?refURL WITH
{
  SELECT DISTINCT ?subject WHERE {
    {
      SERVICE wikibase:mwapi {
        bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "Generator"; mwapi:generator "exturlusage"; mwapi:geulimit "500"; mwapi:geuquery "*.lefigaro.fr"; mwapi:geuprotocol "http"; mwapi:geunamespace "0" .
        ?title wikibase:apiOutput mwapi:title.
      }
    } UNION {
      SERVICE wikibase:mwapi {
        bd:serviceParam wikibase:endpoint "www.wikidata.org"; wikibase:api "Generator"; mwapi:generator "exturlusage"; mwapi:geulimit "500"; mwapi:geuquery "*.lefigaro.fr"; mwapi:geuprotocol "https"; mwapi:geunamespace "0" .
        ?title wikibase:apiOutput mwapi:title.
      }
    }
    BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?subject)
  }
} AS %items {
  INCLUDE %items .
  
  hint:Query hint:optimizer "None".
  
  ?subject ?p ?statement .
  ?statement prov:wasDerivedFrom/pr:P854 ?refURL .
  FILTER (CONTAINS(str(?refURL), 'lefigaro.fr')) .
  
  ?property wikibase:claim ?p .
  ?property wikibase:statementProperty ?ps .
  ?statement ?ps ?object .
  
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Інші режими оптимізації

Blazegraph offers a handful of other optimisation directives including the more expensive "runtime optimisation", based on using sampling to more precisely estimate the likely development in the size of the solution set by a particular sequence of joins; and a directive to run a particular join first.

However, examining the corresponding query report, it appears that runtime optimisation has proceeded with the same query sequence as the static optimisation, and similarly times out.

As for the "runFirst" directive, I have not yet been able to work out how to apply it.

Other queries the optimiser has difficulty with

Performance falls off a cliff, when the number in a group of interest exceeds a certain threshold

This arose in connection with a query to find humans with apparently matching dates of birth and death, for humans from a particular country

France (N=125,223): tinyurl.com/obxk9av -- query report (successful)

Germany (N=194,098): tinyurl.com/pwthc9d query report (times out)

What goes wrong here occurs when the number of items with the given value of country of citizenship (P27), eg 194,098 for Germany, exceeds the items (time-value nodes in fact) with wikibase:timePrecision "11"^^xsd:integer (169,902).

For France, with N=125,223, there is a steady fall-off in the size of solution set: requiring a date of death, then day-specific precision, then a date of birth, then day-specific precision, knocks this number down to 48,537, which is then reduced to 10 by requiring the date of birth and the date of death to match (a very quick operation).

However for Germany, with N=194,098, the query optimiser attempts to start with the 169,902 date-nodes with day-specific dates; but the query then explodes when the system tries to match these to all the people with such day-specific birthdates (1,560,806), a number which reduces when the nationality filter kicks in; but which still eats up the available query time.

(In fact, having looked at the performance data and seen just how fast Blazegraph can do all-against-all matching for really very big sets, it turns out that (with a query rewrite) the system is fast enough to run this kind of query for everything -- see tinyurl.com/pn9xyye -- by matching on date-nodes first, which knocks down the size of the solution set, and not checking until after that the requirement that the date be day-specific).

Adding a names and name-labels wrapper to a query for life events makes it time out

Tracked by Blazegraph

Adding an apparently trivial wrapper, to find the name and name-label, causes the query in the immediately previous section to time out, even when without the wrapper it ran with no difficulty.

Query as presented: tinyurl.com/p8wzb74 -- query report

Hand-ordered query: tinyurl.com/o7z95vb -- query report (successful)

Query without the name lookup: tinyurl.com/obxk9av -- query report (successful)

With the outer-layer in place, the inner layer now starts (even for France) by extracting the birth-nodes that are connected to birth-dates, and then keeping those which have precision=11.

The situation can be fixed by turning off the query optimiser and hand-ordering the query.

It seems this may be a general feature of queries with sub-select clauses: both here and in the next query the optimiser seems to want to start the sub-query with a requirement that includes one of the variable that will be projected to the outer layer.

Adding a property-labels wrapper to a query for properties makes it time out

Query as presented: tinyurl.com/oex59qr -- query report

Hand-ordered query: tinyurl.com/ppkmb3v -- query report (successul)

Query without property-label lookup: tinyurl.com/qe4ukbx -- query report (successful)

Same as the above: for the query including property look-ups, the optimiser apparently wants to start with all triples ?a ?p ?b and then join ?b wdt:P31 wd:Q4167410, rather than the other way round (which is what it manages to successfully do for the smaller query with no property look-up).

This doesn't appear to an artifact caused by BIND, nor the order the clauses are presented -- even if these are changed, the query still runs slow with the built-in optimiser

Alt query: tinyurl.com/q97a8sh -- query report (still fails)

It is still preferring to start with the assertion ?a ?p ?b, even though this is less specific -- apparently because ?p is one of the variables projected out of the inner SELECT statement.

Запити з FILTER NOT EXISTS

Для нормальної роботи запити з «filter not exists» часто потрібно переписувати.

The “filter not exists” clause seems to often be inefficient for unclear reasons. For example a query like this one, featured articles in frwiki which does not exists at all on enwiki times out expressed as a query with “filter not exists”

select ?articleFr ?item ?itemLabel ?some ?someLabel {
    select * {?articleFr schema:about ?item ;
             wikibase:badge ?some ;
             schema:isPartOf <https://fr.wikipedia.org/>
  
    filter not exists {
      ?articleEn schema:about ?item ;
               schema:isPartOf <https://en.wikipedia.org/>
    }
   }
}

however are fine just rewriting using an “optional”, with filtering on a unbound variable

select ?articleFr ?item ?itemLabel ?some ?someLabel {
  {
    select * {?articleFr schema:about ?item ;
             wikibase:badge ?some ;
             schema:isPartOf <https://fr.wikipedia.org/>
  
      optional { ?articleEn schema:about ?item ;
               schema:isPartOf <https://en.wikipedia.org/> .
      }
      filter (!bound(?articleEn))
   }
  }
}

Using MINUS() also works well in this instance. These three methods of removing results all does something similar but goes about it differently. There is not one method which is consistently better/worse, they will all perform differently given the circumstances. In other words you ought to try them all if you are serious about performance.

Rules of Thumb

Here's my explanation of the situation:

FILTER NOT EXISTS needs to check a condition on each solution of the subquery. These conditions are checked in sequence, not by join
OPTIONAL fetches an extra triple pattern related to each solution. This is done by join, i.e. all at once. The FILTER NOT BOUND then can quickly check each row of this extended resultset
MINUS computes the second subquery, then finds the set difference of the first and second subqueries. If both are relatively small, the difference can be computed quickly

So a rule of thumb is:

Use FILTER NOT EXISTS when the negative condition to check produces a huge resultset, so the NOT EXISTS may be less expensive
Use OPTIONAL... FILTER NOT BOUND when the negative condition extends the positive resultset with only a few more rows (or keeps the same rows)
Use MINUS when the negative condition, taken on its own, produces a small resultset

Робота з кешованими запитами

This post may be useful for folks who want to makes sure queries are freshly executed:

https://byabbe.se/2018/01/26/cache-busting-wikidata-sparql-queries

Короткий переказ: щоб вимкнути кешування запитів, додайте заголовок cache-control: no-cache. Крім того, ви можете просто замінити метод GET на POST. Результати запитів POST ніколи не кешуються: якщо ви хочете отримати кешовану відповідь, перемкніться з POST на GET.

Більше... ?

Додайте сюди інші випадки

Retrieved from "https://www.wikidata.org/w/index.php?title=Wikidata:SPARQL_query_service/query_optimization/uk&oldid=2047097059"

Wikidata:SPARQL query service