User:Amir Sarabadani (WMDE)/Database normalization

From Wikidata
Jump to navigation Jump to search

What is Database normalization?

[edit]

Database normalization is a general term that is used to improve design of tables in databases but in here I mean something more specific that is part of the general term. By normalization I mean avoiding repeating strings in databases.

For example, this is a table that's not normalized:

revision
rev_id rev_user
123 Foo

This table is saying edit 123 was made by User:Foo but what if User:Foo makes millions of edits? Then the table grows too large and we will repeat "Foo" millions of times. By normalizing it, we split this table into two:

revision_normalized
rev_id rev_user_id
123 1
user
user_id user_name
1 Foo

In order to get the new data, you basically need to join these two tables on "user_id = rev_user_id".

Why it's being done?

[edit]

Tables across mediawiki and its extensions are being normalized, creation of "actor" table and "comment" table are being done to normalize "revision" table. "change_tag" table has been already normalized and as the most recent work, "wb_terms" is now normalized and replaced with six tables (all starting with wbt_). This won't stop here, more tables are going to change. Tables like "templatelinks", "categorylinks", etc. also need normalization.

This brings the question that why there is so much investment in changing tables and breaking workflows and tools, while the resource can be used on something else. The answer lies in two parts:

DRY

DRY (Don't repeat yourself) is a principle in software engineering that's proven essential in database design. Why? First of all, look at the example above, What if the user_name changes (renaming users)? Then you need to change millions of lines to change "Foo" to "Bar" but if you do it with the normalized version, you're only going to change one row which is going to be literally millions of times faster. Normalizing actor table allowed us to be able to rename users in blink of an eye now.

Storage

Strings are larger than numbers so they take more space in disk, memory, network I/O and so on. To understand the impact, the normalized version of wb_terms is ten times smaller than wb_terms. Normalized version of "change_tag" is half of its original version. How improving storage helps? Let's break it down:

  • Disk: Databases in Wikipedia are running out of space. Currently, the new host for master of Wikidata's database is at 70% (the old hosts were around 85%) and when it reaches 100%, we can't add anything to Wikidata anymore. The whole things breaks but with replacing wb_terms we will be at 40% disk usage. Also keep in mind backups that have the whole database in different points in time also will run out of disk space sooner.
  • Memory: For databases to be able to respond to queries in a timely manner, they load as much as possible of the database rows into memory. In our infrastructure, it's called "InnoDB buffer pool" (technical detail: Our storage engine is InnoDB) and it's 350GB. wb_terms table is 1TB while the new term store is around 100GB. InnoDB buffer pool efficiency (i.e. the ratio of how much data it reads from the memory instead of disk) is at 99.99% currently. One outage in Wikipedia happened because this number for Wikidata's database dropped to 99.9% (one nine drop) because reading from disk is ten times slower than reading them from memory. If we don't normalize the tables, eventually we exhaust this cache and everything will go down. Buying disk is cheap but buying bigger memory is not (if not impossible). Our database hosts have around 512GB for memory and we basically can't go higher than that. To put it simply, we have no choice.
  • Network I/O: In database, we have a "master" database that is the source of truth which replicates to lots of replicas (See https://dbtree.wikimedia.org), People can read from replicas but write has to happen only on master. The replication of changes from master to replicas happen using network. The flow of changes happening on Wikidata is so big that this replication takes a lot of network data flow. Making this tables smaller reduces this flow size allowing us to have more replicas (yay more people being able to read Wikidata and Wikipedia) and also reduces the replication lag (less outdated data).

My queries are slow, What should I do?

[edit]

Normalized tables are by design slower to read specially since using indexes on several tables is not possible but if you apply these hints here, it might even gets faster than non-normalized version:

Avoid joins if possible

That seems counter-intuitive but it's proven to work if done right. Its technical term is "join decomposition" and it basically means do multiple queries instead of one with joins. One reason is that MySQL caches query responses based on hash of the query and if you query the same thing over and over again, it's going to be faster, so you can easily do several query instead of one to get the mapping of for example user name to user id.

The rule of thumb to do join decomposition is that if the query with join(s) returns drastically lower number of rows than the several queries that are not joined, you shouldn't do it otherwise it's better to do multiple queries instead of join.

For example, if you want to query revision table to get the user name of the person who did the edit id 123 and you get the user id instead of user name and you want to fix this, instead of join two huge tables of revision and user, query revision table for rev_id = 123 and get the user_id value (let's say it's 1), then query user table with user_id = 1 and get the username. You can also sometimes cache the mapping (mapping of user_id to user_name is not going to change much) which brings us to the next point.

Cache mappings

If you're doing a query several times and you know it's not going to change much, just put the result of the query in a cache, if you can use key value stores like redis, great but if it's not possible, you can just put in a file on disk and read it from the file than making request to the database. This will be very useful if you combine it with join decomposition (the above hint) and hold the mapping.

It will get better once people start to use it

Remember, when the table is not used by people, it'll get kicked out of "InnoDB buffer pool" (the cache in database, see above) but once it's being used frequently, it'll reside in memory and it'll be faster. Actually, because normalized versions are smaller, you will have higher chance of hitting the cache for really large tables.