Wikidata:WikidataCon 2017/Notes/Wikidata Under the Hood

From Wikidata
Jump to navigation Jump to search

Title: Wikidata Under the Hood

Note-taker(s): Lucas; Geertivp

Speaker(s)[edit]

Name or username: Daniel Kinzler

Abstract[edit]

  • Storage as one wiki page per subject.
  • Storage as JSON.
  • The API.
  • Change propagation.
  • Lua interface

Collaborative notes of the session[edit]

User sees bad data

Clicks on edit

  • Browser shows Wikidata UI
  • Wikdata JS talks to Wikidata API, which uses and edits the Wikidata Item in Wikidata Storage
  • Change Dispatching writes to Job Queue. Does not actually contain the data – client pulls that from Wikidata Storage later

Example scenario: bad information (“Douglas Norman Adams”) in Wikipedia infobox.

User edits “birth name” statement on Wikidata.

using API calls for parsing and formatting (e. g. for dates)

wbparsevalue (datatype + values → JSON / error)

wbformatvalue (datavalue + datatype → wikitext / plaintext / HTML) – e. g. in case of date, HTML ends up in “expert” and plain text in text area

User saves edit. API module: wbsetclaim (could also use some others)

validates the edit and stores the new version

full item is stored as single JSON structure, one full JSON blob as a Wiki revision. database and rest of MediaWiki don’t know much about the structure

(some information is extracted into other tables, e. g. for “what links here”)

one implication of this way of storing data: cannot store huge amounts of data for a single item (e. g. very fine-grained population data or weather data = time )

there is a hard limit on byte size of a single revision (from MediaWiki), 2MB or something like that?

change notifications (a bit complicated because the problem is hard to scale)

check which client wiki needs to be notified about an entry

write entry to client’s job queue to process the change (change dispatching)

if a client is busy, it can take a while until the change is processed (dispatch lag)

usually not more than a few minutes

two-step process: Wikidata writes notification job to client, then that notification job creates more jobs when it runs

purge: purge cache of pages that need to be recreated

the “notification” job also checks if pages actually need to be purged: track which parts of the item are used and if they’re affected by the change (diff)

we’re working on more fine-grained usage tracking

log: write the change to recent changes and watchlists

more eyes on the change hopefully results in better chance of detecting vandalism

requires change to appear in recent changes very timely

if the job queue is lagging, you’ll have to scroll back to see the entry, in between entries you’ve already seen ⇒ won’t be seen

clarification: “dispatch lag” is just the lag until Wikidata enters the notification job into the job queue. does not include the job queue lag for the “notification” job or the “purge” and “log” jobs it creates in turn

Wikipedia receives the change notification and re-renders the page, using Lua to access the item

Lua modules live on the client, under control of the community.

disadvantages: need to be synchronized, fixed, etc. on each wiki

central Lua modules would be nice, have been discussed for a long time, but not there yet (many difficulties)

the Lua modules in turn use a Wikibase Lua library (module: on-wiki, library: provided by Wikibase extension)

the Wikibase Lua library directly talks to the database (with caching – don’t load the same item more than once)

Wikibase always returns list of values for a property, since there can be more than one (e. g. date of birth in two calendar models)

example templates completely populated from Wikidata: infobox telescope; authority control template on enwiki and dewiki

Questions / Answers[edit]

Q: who validates that edits are correct

A: same as on Wikipedia: community looks at recent changes / etc. no automated checking by the software.

Q: is SPARQL deliberately not included in this talk?

A: yes, not relevant for this cycle (updating and rendering new information). Would be on same level as user interface and API:

mapping from internal datamodel to RDF, stored in a triplestore (BlazeGraph)

process looks at recent changes and updates the data graph

Q: when I’m writing a query and then doing some edits with QuickStatements, how long does it take until SPARQL updates? up to a few minutes

A: race condition between change dispatching and query service. also, query service has indicator when data was last updated in lower right corner. and also results are cached for five minutes

more about the meta information: some meta info is also in RDF (e. g. number of statements), but it’s possible that this data is updated in RDF before it’s actually been updated in Wikidata, so the RDF update uses old data.

job queue is not actually a queue. order of jobs is not completely random, but not fully in-order either. so this race condition is hard to solve.

Q: is provenance information in SPARQL?

A: items have a revision history like any other wiki page, but edit history is not in SPARQL. statement should have provenance (references), those are in SPARQL. edit history should not be very meaningul, authority should come from the reference, not the editor

Q: is it possible to see the history of a statement?

A: you can see the history of an item, no nice interface for the evolution for a single statement yet.

RDF is currently just a snapshot of the current state of the data, no history (you can see the RDF representation by adding .rdf to the concept URI)

Q: more about the technical validation?

A: Wikidata wants to store information as given in source. e. g. length can be stored in feet, not converted by Wikibase (but now converted when exported to RDF). we only normalize syntax (e. g. different ways to write numbers).

for dates, calendar model is part of the input (a Q-ID; we have a whitelist of supported calendars)

another layer: constraint checks. constraints maintained by community (e. g. certain properties should only have one value)

originally only a template on property talk pages and a bot that runs daily to check constraints

better integration now, enable checkConstraints gadget, check constraints on everything you view

would-be-nice: check constraints before saving statement

aside: you can abuse AbuseFilter for this purpose

you can save that the father of something is a horse, but it will show up as a constraint violation to you and other users

constraints can also have exceptions (cat as mayor)

Q: when you add a sex or gender and add “man/woman” instead of “male/female”, why can’t you save it?

A: if it’s a constraint check, it should only happen after save. Might be something magical by the community (by the power of MediaWiki). could be an AbuseFilter, some custom JavaScript, …

modeling gender is a good example of Wikidata’s flexibility – look at Chelsea Manning, for example