User:ASarabadani (WMF)/Growth of databases of Wikidata

From Wikidata
Jump to navigation Jump to search

This page tries to propose a more sustainable growth for Wikidata with regards to its core database.

Problem statement

[edit]

Wikidata's core database has been growing extremely fast since its inception. For example, it's 1.5x bigger than database of English Wikipedia while being a decade younger. With the current projections and without interventions, it will soon become unstable and later completely unusable rendering all of Wikidata without a way forward.

After many interventions in the past (wb_terms refactor, pagelinks refactor, MCR, etc.), it has been currently (September 2024) taking up 1.5TB on each database host (and has around 40 replicas). Storage is not an issue here, we can add bigger and bigger disks but we can't expand the memory further and reads from disk is around 1000 times slower than reads from memory and s8 takes around 100,000 queries every second (~2,000 queries a second per replica). These numbers mean after a certain critical size, everything (Wikidata and all other Wikimedia projects) will go down. First only during spikes, then longer and longer outages.

Measuring Impending Doom

[edit]

innodb_buffer_pool_read_requests vs innodb_buffer_pool_reads will obviously give some of the pre-spike hints that there is more coming from disk.

If there's improved instrumentation required to sufficiently identify InnoDB buffer pool strain please do highlight or create MariaDB Feature Request.

Biggest tables

[edit]

As of September 2024, this the list of the biggest tables:

root@db2181:/srv/sqldata/wikidatawiki# ls -sSh | head --lines 15
total 1.5T
 347G revision.ibd
 249G wbt_item_terms.ibd
 143G pagelinks.ibd
 121G content.ibd
  89G slots.ibd
  87G text.ibd
  80G change_tag.ibd
  71G comment.ibd
  51G wbt_text_in_lang.ibd
  47G wbt_term_in_lang.ibd
  25G page_props.ibd
  25G wbt_text.ibd
  17G cu_changes.ibd
  17G wb_changes_subscription.ibd

This provides a better picture. For example, removing statements from items won't help much (they are not stored in core databases, they are stored in our blob store which can horizontally scale). Revision storage (revision + content + slots + text tables) takes around half of the database. Adding a lot more edits will contribute to further instability and outages.

On top of that, most of Wikidata's data is concentrated in several extremely large tables. This is making maintenance quite difficult (for example, doing an alter table on revision table is currently taking two to three days).

Solutions

[edit]

Short-term interventions

[edit]
MCR clean up (phab:T183490)

This is ongoing and will basically empties text table saving around 80GB.

Moving term store (wbt_* tables) to its own dedicated cluster (phab:T351802)

This will move around 400GB from current core cluster to its own cluster. This means any WMCS tool relying on term store needs to be updated and they won't be able to join with core tables anymore as they will be residing in two physically separate servers.

Long-term solutions

[edit]

revision storage

[edit]

We need to do something about the revision storage. MediaWiki core treats tables such as revision as a "ledger" and keeps everything forever. We need to ask ourselves, do we need to keep every edit in Wikidata forever?

Potential solutions include removing revision rows for bots that have made many consecutive edits in short period of time (only keeping the first and the last ones) and or setting up a dedicated "archive" cluster and moving any revision older than 2020 to there. This means many many places in MediaWiki and tools will need to change and at least months of engineering work will be needed (and hardware needs to be budgeted and bought and racked up, etc.). Not to mention complexities such as how treat comment table, actor table, etc.

One low hanging fruit is to reduce the speed of growth. Having a policy forcing bot operators (or gadgets) to make as little edits as possible will have a major impact. It doesn't even need to be only edit for an action, just going from 20 edits to create an item to 2 will reduce the growth caused by that bot by 90%. Additionally, the UI of Wikidata itself currently doesn't batch edits. Fixing that would help a lot as well since a lot of edits happen through the UI (75M edits by September 2024 according to Special:Tags)

MariaDB Relevant Features

[edit]
  • ALTER ONLINE TABLE - MDEV-16329 Engine-independent online ALTER TABLE (added 11.2) - will at least keep the table available.
  • Dropping off old data could be simple partitioning and dropping a partition on old data.

Speeding up the ALTER ONLINE TABLE, might possible if there are multiple partitions in the table. I don't recall an existing feature request.

I'm wondering how to get better current text in the buffer pool. An assumption is that with a innodb page size of 16k, there could be old data and new data both in the same 16k page. This equates to a bit of waste in the buffer pool once the page is loaded. I'm not sure how to validate this yet.

There may be scope for System Versioned Tables, with partitions to keep historical data in a different partition ensuring its not on the same innodb page. A minimal implementation would just delete the row, knowing that its saved as a version, and require that the retrieval use an `AS OF TIMESTAMP` syntax. I haven't looked closely at structure/implementation to see how easy it is to support this SQL Standard of which MariaDB is one of the few implementers.

scholarly articles

[edit]

This has been the source of many issues (from WDQS to revision and page table growth to term store size, to pagelinks, to many many other aspects). Setting up a dedicated wiki (such as "Wikicite") with its own entity type ("Citation" so C1, C2, etc.) would allow better use of these data in Wikimedia projects and allows us to split the data in a more horizontal way. Of course, this requires a lot of engineering work.

Small ideas

[edit]
  • change_tags table is overly big (80GB, comparable with revision table of English Wikipedia) because every edit done by OAuth tools adds a row (see Special:Tags). The question is that do we really need this? Can we remove old rows at least?
  • normalizing page_props (pp_property column) will free a little bit of space.
  • pagelinks in wikidata is useful but these use cases can also be done in another way. For example, stop storing pagelinks entries for properties in items. That would cut the size of that table to roughly one third of its original size.