Wikidata:SPARQL query service/WDQS backend update/March 2023 scaling update

From Wikidata
Jump to navigation Jump to search

Dear Wikidata community members,

Thank you for the continued conversations about the challenges Wikidata Query Service (WDQS) is facing. For the past year or so, we have shared these challenges and our thoughts and plans to address them. We have made much progress together, but have more to do, and there are no quick and easy answers.

As many of you know, the Wikimedia Foundation (WMF) and Wikimedia Deutschland e.V. (WMDE) have been partnering on Wikidata, since the Wikidata project debuted in 2012. Wikidata and the Wikibase Ecosystem strive to collect and organize the data that shapes humanity’s understanding of the world and our partnership is important, because Linked Open Data is essential for the Free Knowledge Movement.

WDQS is important to fulfilling Wikimedia’s Linked Open Data vision. It is foundational infrastructure for understanding and maintaining Wikidata’s data, as well as reusing Wikidata’s data inside and outside the Wikimedia movement.

There has been much work the past couple of years developing a plan to scale WDQS to better support current and future Wikidata use. As we’ve improved technology in other areas, we’ve taken a step back to look at all the solutions to take advantage of what’s available, while balancing all Wikimedia community requests and projects. Here is a WDQS current state and summary of the suggestions with the most impact that WMF and WMDE are jointly exploring now, with a few in-progress.

Blazegraph Scaling[edit]

Our graph backend, Blazegraph, is end-of-life software and poses substantial operational risks to us. Wikidata’s triple count has also grown to ~14.3 billion from ~13 billion in August 2021, increasing concerns about maxing out our system’s operational capacity, given how Blazegraph allocates memory to write data.

As discussed in previous updates, we have four options for alternative graph backends and guidelines to experiment, along with insights to review the architecture and consider the data model that best supports what editors and reusers are querying.

Community member and reuser experience[edit]

WDQS users have pain points that we know:

  • Reasonable queries fail and time out. Graph growth causes more queries to time out over time.
  • In a small number of cases, the data in Blazegraph diverges from the data in Wikidata. The database reloading process that corrects for this divergence has become increasingly likely to fail, owing to what is likely a bug in Blazegraph itself.
  • Wikidata edits have lag time until they are reflected in WDQS
    • Drastically improved with the WDQS Streaming Updater. Over the last 30 days, latency was under 10 minutes 98% of the time and the median latency was approximately 2 minutes.

Equitable service also continues to be a top priority. We want to ensure that the public endpoint is available for anyone who needs it and not just a small number of users with a high volume of requests or complex queries. As with all databases, query analysis informs resource management decisions and we’ll continue to learn more of what is important to editors and reusers (see also Wikidata Basic Analysis).

Features and improvements[edit]

The team has been looking at different ways to address demand for edits, queries, and data size that are beyond what the current system can support, examining ways to stabilize WDQS into the future. Some improvements, such as the Wikibase REST API, are in development. For other suggested solutions, we are planning experiments in the next 6 months to understand how each solution helps editors and reusers, and use our research to develop technical specifications to guide development. Each of the solutions included here came up in prior WMF & WMDE community conversations.

User needs[edit]

Move Some Traffic to Other Existing and New APIs. Some of the traffic to WDQS is for queries that are better served by other systems, such as querying data on individual items. We need to improve existing APIs (e.g., Search) and build new APIs (e.g., Wikibase REST API), and then encourage editors and reusers to use the APIs that best meet their needs. The REST API is being developed by the WMDE Wikibase Product Platform team. More about methods to access Wikidata: Wikidata:Data access

Decouple SERVICES from SPARQL. The 60 second timeout limits how long a SPARQL query can run against Blazegraph. Running a SPARQL query does not involve fetching anything from Wikidata directly. The "SERVICE" clauses that can be added to a SPARQL query, for example fetch the labels for the Items in the query result, must have the time to run this additional step within this timeout limit. We are exploring structuring the SPARQL SERVICES as microservices that run after the WDQS results are returned or that can be called on demand. This will decrease the time a query using such a SERVICE currently runs.

Introduce Upper Limits. To prevent commercial and other high-volume users from potentially overwhelming the query service, we can handle high-volume users differently, addressing a long-term need for the query service. The goal is to keep large amounts of complex queries written by any user from blocking others. It will be possible to tune limits based on user or use case in a way that doesn't prevent high-volume users from interacting with the service, but helps us achieve our goals for equitable service.

Warning About Ill-suited Queries for WDQS. People are using WDQS for things for which it's not well suited. If we can automatically detect ill-suited queries, and warn people and offer them better alternatives, this will be a better experience for editors and reusers. The warning can include specific error codes with some guidance. We would pre-validate queries and facilitate re-routing ill-suited ones off of SPARQL and block the query from getting into the queue, resulting in reduced load on WDQS.

Data management[edit]

Reduce Redundant Data. In-progress work by WMDE to reduce data, the team is introducing the MUL language code and will later look at introducing a Lua module to get inverse relations. Reducing redundant data means fewer errors caused by storing and managing the same data several times. The result is less maintenance work for editors. Once deployed, the Wikidata Team will work with the Community to remove redundant data, e.g., name of person in 300 languages.

Build out the Wikibase Ecosystem. A part of the data that is currently in Wikidata is going to be better served in dedicated Wikibase instances in the larger Wikibase Ecosystem. Building out the Wikibase Ecosystem will enable us to focus Wikidata more on general purpose data and lexicographical data, reducing the amount of data Blazegraph has to handle, while still keeping these data open, accessible and queryable. We estimate that approximately 40% of the data in Wikidata would be moved to a different Wikibase instance. This important work is in-progress and is being led by the WMDE Wikibase teams: see m:LinkedOpenData/Strategy2021/Wikibase.

"Splitting" the Graph. Given the information that Wikidata is desired to contain, there are no current, suitable open source graph databases that can store it as a single graph and also guarantee ease of maintenance. Splitting the graph is 'unavoidable' partially because of the ambition of how much editors want to add to Wikidata. Both this expectation and the technical infrastructure must change. The data would be split between two or more instances of Blazegraph. There are different logical ways to split the graph, such as by language, or topic or truthy graph versus full graph (reified). Next, we need to determine the benefits and drawbacks of the various ways to split the graph. Analysis and approach in the disaster recovery playbook is one option to inform our decisions.

Make it Easier for People to Run Their Own Query Service. Some people and organizations that are currently using the public endpoint on a larger scale should instead be empowered to run their own query service instance where they do not impact the service for everyone else and are not restricted by the timeouts and other limits of the public endpoint. This is currently already possible but cumbersome. We can make it easier and encourage it more.

Long-term Considerations[edit]

Other recommendations to continue to explore together include both user experience and Wikidata data management.

Make it Easier to Work with Dumps. Currently a number of queries to WDQS would be better served by using the data dumps and running analysis on the dumps instead of using the public endpoint. Due to their size, it is becoming harder to work with the dumps. We can provide subset dumps, so editors and reusers can download a significantly smaller amount of data that contains the data they want.

Scheduled Bulk Queries. For high volume datasets, we can introduce tools that allow us to produce the dataset on a scheduled cadence versus real-time. This allows us to serve back requests that may take longer than 60 seconds to process and to build mechanisms to optimize handling of large requests. Editors and reusers can submit a query which is then programmatically slotted into the queue at optimal times to run, and return results via UI, SFTP, and other ways editors and reusers want. If the request causes issues, we can build in mechanisms to cancel out.

Next Steps[edit]

As mentioned, over the last several months, we’ve taken a step back to assess all options to improve Wikidata editor and reuser experience. During this fiscal year, we were unable to move forward with experimentation with Blazegraph alternatives as engineers worked on other critical tasks.

We will continue the work already underway for each of the solutions outlined above and move forward with exploration on the others over the coming months. The WDQS capacity conversation also continues at WMF as we’ve begun annual planning for next fiscal year, which you can read more about here.

We will also be hosting two office hour sessions on Jitsi (https://meet.jit.si/WDQSOfficeHour) to directly address your comments and questions with the team.

The first session will take place on March 27, 2023, at 17:00 UTC. Additionally, to accommodate those in the ESEAP area, we have scheduled a second office hour on March 28, 2023, at 8:00 UTC. We hope you will be able to join us during one of these sessions to discuss this important topic further.

Thank you for reading and I look forward to more conversations on progress and impact for the Wikidata community editors and reusers.

Best,

Shari Wakiyama, WMF Director of Product Management
the WMF Search team & WMDE