Wikidata talk:SPARQL query service/WDQS graph split

From Wikidata
Jump to navigation Jump to search

What's a graph?[edit]

Are graphs synonymous with the SQL database, or are they something else? Thanks. –Novem Linguae (talk) 13:15, 8 February 2024 (UTC)[reply]

They are not entirely synonymous. The full Wikidata graph has its primary storage in the MySQL/MariaDB database behind the Wikidata wiki. (I'm using "graph" a bit loosely in this context, as the SQL database does not store the data in a form that is easily transformed into a graph). In the current state, that SQL representation of the data is loaded into a Blazegraph RDF store to allow querying. So at the moment, there is a strong correlation between the SQL graph and the RDF / SPARQL graph (the reality is slightly more complex if we talk about Lexemes as well, but let's ignore this for the moment). The proposal here is to keep the SQL part of this untouched, but to split the RDF / SPARQL side into distinct subgraphs for scholarly and for the main Wikidata graph. I hope that clarifies the question. GLederrey (WMF) (talk) 13:31, 8 February 2024 (UTC)[reply]
Got it. So SPARQL is a service that has converted the SQL database into an w:RDF store? Soon to be two RDF stores? –Novem Linguae (talk) 13:42, 8 February 2024 (UTC)[reply]
Correct! GLederrey (WMF) (talk) 14:36, 8 February 2024 (UTC)[reply]

technical structure of the split[edit]

Do I understand correctly that the way you are splitting is by the subject item in a triple? So ?a ?b ?c is in the scholarly subgraph if ?a is an instance of scholarly article, otherwise it's in the main subgraph? ArthurPSmith (talk) 19:36, 9 February 2024 (UTC)[reply]

@ArthurPSmith: At the moment of writing subclasses of scholarly articles are included. Proof: https://query-scholarly-experimental.wikidata.org/#select%20%3Finst%20%3Flabel%20%3Fcount%0Awith%20%7B%0A%20%20select%20%3Finst%20%28count%28%2a%29%20as%20%3Fcount%29%20where%20%7B%0A%20%20%20%20%5B%5D%20wdt%3AP31%20%3Finst%20.%0A%20%20%7D%20group%20by%20%3Finst%0A%7D%20as%20%25i%0Awhere%20%7B%0A%20%20include%20%25i%0A%20%20service%20%3Chttps%3A%2F%2Fquery.wikidata.org%2Fsparql%3E%20%7B%0A%20%20%20%20%3Finst%20rdfs%3Alabel%20%3Flabel%20.%20filter%28lang%28%3Flabel%29%3D%22en%22%29%0A%20%20%7D%0A%7D%20order%20by%20desc%28%3Fcount%29%0A
There seem to be some bad subclasses listed. Errors in the classification will obviously cause items to end up in the wrong graph removing results from queries. Since it takes a month to initially populate a graph this issue could be bad if left unaddressed. Infrastruktur (talk) 23:40, 12 February 2024 (UTC)[reply]
I'm not sure if items that are instance of say "scholarly article" and "parchment" end in one of the graphs or gets duplicated on both, the latter seems safer. Infrastruktur (talk) 23:55, 12 February 2024 (UTC)[reply]
Guess they are duplicated: https://query-main-experimental.wikidata.org/#select%20%3Fitem%20%3Flabel%20%0Awhere%20%7B%0A%20%20bind%28wd%3AQ24669646%20as%20%3Fitem%29%0A%20%20service%20%3Chttps%3A%2F%2Fquery.wikidata.org%2Fsparql%3E%20%7B%0A%20%20%20%20%3Fitem%20rdfs%3Alabel%20%3Flabel%20.%20filter%28lang%28%3Flabel%29%3D%22en%22%29%0A%20%20%7D%0A%7D Infrastruktur (talk) 00:25, 13 February 2024 (UTC)[reply]
Please disregard. My conclusions were incorrect. Infrastruktur (talk) 07:08, 13 February 2024 (UTC)[reply]
Hi @ArthurPSmith, good to hear from you! And thanks @Infrastruktur for the cool queries.
The underlying transformations are composed of several parts, but I'll try to focus on the essential pieces.
There are routines that pull the Wikidata dumps into HDFS, producing about 15 billion quads for the full graph - context, subject, predicate, object.
Those quads undergo extraction in ScholarlyArticleSplitter.scala to create the split. I'll see if we can get some time on a Meet go through it, but will try to summarize here what happens.
Scholarly graph:
1. Find subjects whose P31 is a scholarly article (Q13442814) and find all quads whose context matches those subjects.
2. Find the references for the elements in 1.
3. Get the values for the elements in 1 and 2.
4. Add those together to produce the scholarly graph.
Main ("non-scholarly") graph:
5. From the full graph, subtract the items identified in 1.
6. Then remove from 5 the references and values that are only attached to the scholarly graph, but keep any other references or values - some references and values are used in both graphs. ABaso (WMF) (talk) 03:34, 13 February 2024 (UTC)[reply]
@ABaso (WMF): Thanks. By "values" (item 3) where the value is an item, are any triples/quads associated with that item also included, or only the item id/URI? I'm assuming properties are fully present in both graphs. Are lexemes only in the main graph? And do you understand what's going on with the timeouts that I and others seem to be seeing on this (see the Phabricator task comments from the last few days)? Is that something that can be fixed, or are we not querying correctly? ArthurPSmith (talk) 14:57, 13 February 2024 (UTC)[reply]
@ArthurPSmith Thanks. I'll try to respond piece by piece here, and appreciate your help getting down to specifics.
I think this may help in part on the question on triples (in HDFS, quads) associated with a value - here are a couple examples of what you might see in the scholarly graph:
https://w.wiki/9AFq
https://w.wiki/9AG2
Would you have an example in mind, though?
Regarding properties, is the question about whether the same set of wikibase:propertyType predicate triples exist in both graphs? If so, yes, that's the case, and the actual triples employed for a given property depend on item-to-property assignments. This said, is there a different aspect of this question to consider?
I'm seeing lexemes only in the main graph, at least based on looking for triples with a predicate of ontolex:lexicalForm.
For the timeouts: I haven't looked closely yet, but I think this may be the result of the summed times, particularly when BlazeGraph is exhibiting a sequenced behavior, exceeding 60 seconds, which is the base timeout value on the given endpoint. One way to mitigate this could be in these experimental endpoints to bump the timeout up to 2 minutes to try to sidestep this for this experimental period. This way if one side takes one second and the other takes 60 seconds it's okay for now. We would probably want to examine this more closely later, as we wouldn't really want to just allow queries directed at one graph to get more time-expensive. Now, I have heard that there can be challenges when the results from a federated target are too big - and it's possible what we're seeing may be the symptom of that sort of thing manifesting in the merger of the results between the graphs for federated queries. Are folks seeing that, for the case of a federated query, the queries issued in isolation against their specific graph are both taking under 60 seconds apiece? ABaso (WMF) (talk) 23:23, 13 February 2024 (UTC)[reply]
@ABaso (WMF): thanks - some comments, a little out of order:
  • on properties - I guess my main concern was with the auto-complete in finding properties by name, but it looks like auto-complete is searching both graphs for properties and items so no issue there.
  • for an example - say I want a list of all the authors, wikidata ID and English name, on a particular paper. On the scholarly graph if I do this - https://w.wiki/9AcH - I get the id's as labels. If I query for any triple with author as subject I get nothing. So authors are present in the scholarly graph only as item id's, with no statements of their own. The same query on the main graph gives nothing (that's what I would expect). If I try the same query using federation: https://w.wiki/9AcP - the labels work, and the list of author items is good (and very quick - 203 ms). So great. But then what if I want some other data on these authors? If I try this:
select ?author ?b ?c WHERE {
SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
SELECT ?author WHERE {
wd:Q22683203 wdt:P50 ?author .
}
}
?author ?b ?c .
} LIMIT 1
I get a timeout. Am I doing federation wrong, or is something broken here? ArthurPSmith (talk) 15:38, 14 February 2024 (UTC)[reply]
It's a join order thing. ?author ?b ?c . hint:Prior hint:runLast true . Infrastruktur (talk) 16:05, 14 February 2024 (UTC)[reply]
Like this?
SELECT ?author ?b ?c
WHERE {
  ?author ?b ?c . hint:Prior hint:runLast true .
  SERVICE <https://query-scholarly-experimental.wikidata.org/sparql> {
    SELECT ?author
    WHERE {
      wd:Q22683203 wdt:P50 ?author .
    }
  }
}
ABaso (WMF) (talk) 16:51, 15 February 2024 (UTC)[reply]
Ok that seems very non-obvious - it does work though. The hints are Blazegraph specific I guess. Can we document/explain this somewhere? ArthurPSmith (talk) 19:28, 15 February 2024 (UTC)[reply]
Good talking with you today @ArthurPSmith! Here's the wiki page (GitHub) on BlazeGraph SPARQL query hints. Thinking for forward compatibility for any new engines, suffice it to say we want to be mindful of too much BlazeGraph-iness in queries, but hope this helps! ABaso (WMF) (talk) 21:08, 15 February 2024 (UTC)[reply]
By the way, I noticed that one may need to scroll the table on that wiki page in order to see the datatypes and default values for those query hints. ABaso (WMF) (talk) 21:13, 15 February 2024 (UTC)[reply]

statements for a single property are also split. Query doesn't work[edit]

This query is one that requires information from both QS-s. I just had a go at getting the information together, but found that the needed statements are quite fragmented over the two QS instances. Results from Finn and me confirm that statements for a single property are split over the two QS instances. I find it quite unpredictable to see which statements are split and that means that for many statements, the statement needs to be queried at both instances. I got close to a solution, but in the last step (getting the venue info (published in (P1433)) is split too, and then I still get the timeout :/ Egon Willighagen (talk) 11:34, 10 February 2024 (UTC)[reply]

I created a query to demonstrate this: https://w.wiki/98km And I wrote up some context why this is has massive implications to any tool using the SPARQL here: https://github.com/WDscholia/scholia/issues/2423#issuecomment-1937489201 Egon Willighagen (talk) 09:33, 11 February 2024 (UTC)[reply]
@Egon Willighagen Thanks for this feedback, we'll evaluate it as we go on with the tests. Sannita (WMF) (talk) 12:26, 12 February 2024 (UTC)[reply]
@Sannita (WMF), one thing you could perhaps try... let's say, one split contains everything "P31 Q13442814", what if you keep all statements from/to these items in both splits? I wonder if that solves at least some of the problems. (Yes, I understand that the split is also less efficient then). -- Egon Willighagen (talk) 17:28, 12 February 2024 (UTC)[reply]
@Egon Willighagen thanks for the analysis and exploration here. I'm going to see if we can get find some mutually available time to discuss a bit further and see if there are some particular "quick wins" we might consider. I'll follow up on email soon. ABaso (WMF) (talk) 04:03, 13 February 2024 (UTC)[reply]

has the decision to remove scholarly article been decided now?[edit]

Guillaume Lederrey writes in the email earlier this week the following: Scholarly articles is the only split we know of that would reduce the graph size sufficiently. We can work together on providing support for a migration, on reviewing the rules used for the graph split, but we can’t just ignore the problem and continue with a WDQS that provides transparent access to the full Wikidata graph. To me this sounds the choice has now been made that scholarly publications will be removed from the main WDQS. Can this be clarified? Egon Willighagen (talk) 11:39, 10 February 2024 (UTC)[reply]

@Egon Willighagen Right now, we are still experimenting the split of the graph and no final decision has been made yet. Unfortunately, as Guillaume said, we haven't really seen another meaningful way to split it that fulfills all the requirements we have set. We acknowledge this whole situation is unfortunate, however we want to stress this is dictated by the necessity of finding a solution, and will affect only how we model and query data, not the data itself. Sannita (WMF) (talk) 10:57, 13 February 2024 (UTC)[reply]

Performance of scholarly test server?[edit]

I was hoping to do performance comparisons at some point but I don't think that's going to be possible - I was looking for a certain sample of article items and ran this query:

select ?item ?stated_as WHERE {
  ?item p:P50 [ps:P50 ?author; pq:P1932 ?stated_as] .
  FILTER(wikibase:isSomeValue(?author))
} LIMIT 10

But it timed out on https://query-scholarly-experimental.wikidata.org/. It ran in 4.2 seconds on live wdquery. Why would the scholarly test server have more trouble with this? Anyway it seems the systems are not going to be comparable performance-wise, unless there's something else going on here? ArthurPSmith (talk) 15:04, 4 April 2024 (UTC)[reply]

@ArthurPSmith thanks for pointing this out, I'm still not sure why this query is having issues on the scholarly endpoint. So far I could see two possible reasons:
  • wikibase:isSomeValue requires some optimization in blazegraph to be set up and for some reasons we might have forgotten to enable them?
  • the papers that have an P50 with novalue are way less frequent on this subgraph possibly forcing blazegraph to scan for a longer period of time before being able to collect 10 items?
I'll do some more research on this. DCausse (WMF) (talk) 19:50, 4 April 2024 (UTC)[reply]
Oh, the frequency may well be very low, that's a reasonable possibility, I hadn't thought of that. ArthurPSmith (talk) 17:03, 5 April 2024 (UTC)[reply]
I did run the following query:
select (COUNT(*) AS ?c) WHERE {
?item p:P50 [ps:P50 ?author; pq:P1932 ?stated_as] .
FILTER(wikibase:isSomeValue(?author))
}
to force finding all the solutions.
It took:
  • 2 minutes on the scholarly graph finding 22 solutions
  • 4 minutes on the full graph finding 796 solutions
So indeed I believe that, counterintuitively, such query with a LIMIT might allow blazegraph to return quicker on the bigger full graph given the higher probability to collect 10 solutions in the allowed 60 seconds. Asking blazegraph to collect the same proportional amount the scholarly graph has (~50%) on the full graph (LIMIT 400) it does timeout as well.
While I have it and in case it might be helpful, here is the full list of solutions from both graphs:
DCausse (WMF) (talk) 08:42, 19 April 2024 (UTC)[reply]

QLever SPARQL engine[edit]

Hello, has it been considered to (also) use QLever as SPARQL engine in order to improve performance?

In my experience, it is possible to query the Wikidata database with QLever without the need for slicing/splitting queries into smaller parts/segments. Although, the data on QLever currently is only updated once a week (which might be sufficient for a lot of queries). M2k~dewiki (talk) 15:51, 18 April 2024 (UTC)[reply]

I just found
regarding this issue. M2k~dewiki (talk) 15:55, 18 April 2024 (UTC)[reply]
@M2k~dewiki yes QLever is definitely on our radar as a possible replacement for blazegraph to serve WDQS. I would encourage users to test the alternatives listed in Wikidata:SPARQL_query_service/WDQS_backend_update/WDQS_backend_alternatives and provide them with feedback that might help improve their solutions. We don't have immediate plans to host our own QLever instance but we are hoping that in the near future they can work on adding support for realtime updates which is one of the key features it is currently lacking. DCausse (WMF) (talk) 07:45, 19 April 2024 (UTC)[reply]
QLever and Virtuoso seem like the best potential replacements so far; would love to see each of them improve their support for this particular use case, and a re-evaluation along the lines of the eval from a few years back. Sj (talk) 15:13, 16 May 2024 (UTC)[reply]

Parallelizing evals of other graph dbs?[edit]

Having a separate specialized graph for citations may also make operational and practical sense in general, and fits into the modern world that already has lots of large specialized open graph dbs.

However a lot of this work and its assessment seems to be performance evals, defining a benchmark of common queries to support and defining what sort of graceful failure is sufficient for core use cases. This is comparable to some of the work of assessing other graph dbs for Wikidata, and aiui we are committed to migrating away from blazegraph, no matter what.

Sannita wrote "we are focusing our work on this split, as we anticipate that it will buy us time to make better decisions about which alternative to move to. Reducing the size of the graph also makes the replacement easier in the future."

Does this mean no progress will be made this year on blazegraph migration planning?

  1. How much time does the split buy us? One of the goals for success is the ability to reload the graph from scratch in 10 days, but I believe an earlier scaling update said that an attempt last year took 3 months?
  2. When would evaluations of migration paths start up again? After the ~1yr of testing and realizing this split?
  3. Would we now compare the performance of blazegraph replacements for both single graphs and federation?
Proposal
revisit the conversations with other graph db providers, noting this split and its motivations, and invite them to demonstrate how they would handle a benchmark of queries, including federated + unfederated approaches to citations. In the course of doing this, rerun the evaluation matrix (and other benchmarks that have emerged since then). Maybe even allocate funds for honoraria to the graphdb teams we're evaluating to help them host working demos of their db running a WD or WikiCite snapshot.
Their input during this split experiment would also provide helpful input into how they would tackle the exact scaling issues we are facing. Sj (talk) 16:36, 16 May 2024 (UTC)[reply]
We expect that the graph split will buy us 5 years of growth at the current rate. So we should have enough time to prepare for the next steps before the situation is critical. The main issue with reloading the graph from scratch is that when the graph grows, the reload fails and start taking exponentially more time to complete because we need to restart from scratch multiple time. This seems to be addressed with the current graph split experiment.
We want to complete the graph split first, so that we keep focus and make progress. It is very unlikely that we will start working on migrating away from Blazegraph in the next year. When we start that work, we will need to decide on the criteria for that migration, including if we want to continue with split graphs or evaluate the possibility of having a single graph. My current intuition based on the previous evaluation of Blazegraph alternatives is that a split graph still makes sense with a different backend, as it allows us to look at more alternatives and provides more options in terms of scalability. But that will need to be reviewed once we start the migration project.
GLederrey (WMF) (talk) 08:44, 17 May 2024 (UTC)[reply]
+1 on both. In my opinion the main benefit of this proposed process is that it's reversible, unlike some options discussed for the past 6 or so years. However we can fully take advantage of this reversibility only if we have some shared understanding of what it would take for us to decide whether to continue with split graphs or not.
Ideally we'd have some assessment criteria which the Scholia community and other scholarly graph users would agree on, as well as a process to produce an assessment. Otherwise in x years we'll be in the same situation where people feel WMF makes a decision unilaterally and externalise the costs onto others (or is it easier to automatically rewrite federated queries back to a single graph model?).
For example, the WMF may set aside some funding to outsource some alpha testing to the developers of the three most promising blazegraph alternatives, where they'd run the scholarly graph on a test server, help users rewrite the queries as needed, assess the performance, introduce any necessary fixes. Unlike with the primary graph, it could be relatively easy to test mirrors with a significant portion of the real workload, especially if Scholia maintainers are onboard. Nemo 09:31, 19 May 2024 (UTC)[reply]

WikiCite collaborations and OpenAlex[edit]

The SPARQL endpoint split is obviously painful for the Scholia and WikiCite community, but I'm very curious to see what opportunities this opens up. For example, it will become easier to request funding specifically for the separate endpoint (or for a mirror which would offer different features). On this note, I see that OpenAlex has already been used to provide (SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples). https://semopenalex.org/resource/semopenalex:About is quite cute too. Can't wait to see the possibilities of federated queries as first class citizens! Nemo 09:14, 19 May 2024 (UTC)[reply]