User:Multichill/Guide to data importing

From Wikidata
Jump to navigation Jump to search

Data importing or data ingestion is the creation or expansion of multiple items in an automated manner. This guide aims to explain how to do this.

Before you start: licensing and permissions[edit]

Before you even think about data importing, be sure 100% sure that the metadata under a Creative Commons Zero (CC0).

But my data is published under a CC-BY-SA license! We can't use it through auto mass import because it is incompatible with the license of Wikidata. It's a restrictive license for data and goes against our mission of providing free, reusable data.

For data in databases in the USA we have a similar copyright situation as reproductions of 2D artworks: It's (probably) in the public domain. See the Wikimedia Foundation’s preliminary perspective on the legal issue of database rights. So from a legal standpoint we're probably save to copy data from a museum in the US, just like we can copy the reproductions of old paintings. Just like with with copying reproductions, we rather do it together with the GLAM than without their knowledge and consent. See also this overview of cases studies of working together with GLAMs.

Examples[edit]

Throughout this guide I'll use two examples: One Rijksmonument and one painting in the Rijksmuseum. Both have data available in json format. This could be any other format like cvs, tsv, xml, etc as long as it is record based. Some example code is available in git

Rijksmonument[edit]

Grote Kerk (Q1545193)  View with Reasonator View with SQID from https://tools.wmflabs.org/heritage/api/api.php?action=search&format=json&srlang=nl&srcountry=nl&srid=19264 :

Painting[edit]

The Continence of Scipio (Q17340637)  View with Reasonator View with SQID from https://www.rijksmuseum.nl/api/en/collection/SK-A-4690?key=secret&format=json :

Analyze the data[edit]

Take on or more records and see if you can see what all the fields are about. Look for a field that can function as a primary key and a field that can be used for the label. You don't have to use all fields in the original record and you don't have to understand all of them from the start

Rijksmonument[edit]

  • "country" - Every Rijksmonument is in the Netherlands so we can just add country (P17) -> Netherlands (Q55) to every item
  • "lang" - We can use this to set the label in the right language
  • "id" - Can be used in Rijksmonument ID (P359). Dedicated property for unique key
  • "adm0" - Same as country
  • "adm1" - On of the provinces of the Netherlands
  • "adm2" - One of the municipalities of the Netherlands
  • "adm3" - Probably empty
  • "adm4" - Probably empty
  • "name" - Could be used for label
  • "address" - Can be used in P969 (P969)
  • "municipality" - Same as adm2
  • "lat" & "lon" - Can be used in coordinate location (P625)
  • "image" - Can be used in image (P18)
  • "commonscat" - Can be used in Commons category (P373)
  • "source" - Permalink to the source, not used
  • "monument_article" - Wikipedia article. Mostly empty
  • "registrant_url" - Url to the source, can be used as reference
  • "changed" - Not used

Painting[edit]

The metadata from the Rijksmuseum is a bit more extensive. Some selected fields:

  • "objectNumber" - Can be used in inventory number (P217). It's not a dedicated property so to make it unique we add collection (P195) -> Rijksmuseum (Q190804) to the item and as qualifier
  • "language" - We can use this to set the label in the right language. Rijksmuseum api can serve in Dutch (nl) and English (en)
  • "title" - Could be used for label
  • "physicalMedium" - Can be used in instance of (P31)

Find unique key[edit]

For every data import we need to find an unique key. This way we can update existing items instead of creating duplicate items on each run. We have two types of unique keys:

  • Dedicated properties: Properties that are unique in itself
  • Combined properties: A property that is unique in a certain domain. This property needs to be qualified by it's domain

Wikidata Query can be used to make a lookup tabel with <key> -> <wikidata item>

Rijksmonument[edit]

Each Rijksmonument (Q916333) has an unique identifier. This identifier is stored in the dedicated property Rijksmonument ID (P359). For our example Rijksmonument ID (P359) is "19264". Each Rijksmonument should have exactly one identifier and each identifier should only be used on one item. Each night a constraint violation report is run to report on any items that don't confirm to these and other constraints. Wikidata Query output for lookup table

Painting[edit]

Each painting in the Rijksmuseum has an unique object identifier. This identifier is only unique within the Rijksmuseum collection. This identifier is stored in the combined property inventory number (P217). It's not a dedicated property so to make it unique we add collection (P195) -> Rijksmuseum (Q190804) as qualifier. For our example inventory number (P217) is "SK-A-4690". For these combined properties we don't have constraint violation reports (yet). Wikidata Query output for lookup table

Clean up existing items[edit]

You figured out the data, you have some sort of unique identifier, you want to start importing. Hold on. Before you do that you should first clean up existing items so that you don't import duplicates or increase a messy subset. This generally consists of several steps:

  1. Try to find all in scope items and add statements to put them in your view
  2. Add identifiers to all in scope items
  3. Do other clean up on existing items (optional)

Rijksmonument[edit]

Every Rijksmonument should have heritage designation (P1435) -> Rijksmonument (Q916333). I had a bot loop over nl:Categorie:Rijksmonument and subcategories to add this statement to all articles (and thus items) about Rijksmonumenten. When this was done I asked Autolist which items are a Rijksmonument, but don't have an identifier. I worked on all the items to find the missing identifiers so the list should be empty.

Paintings[edit]

Every painting in the Rijksmuseum should have instance of (P31) -> painting (Q3305213) and collection (P195) -> Rijksmuseum (Q190804). I had a bot loop over en:Categorie:Schilderij in het Rijksmuseum and nl:Category:Collections of the Rijksmuseum Amsterdam to add this. When this was done I asked Autolist which items are in the Rijksmuseum collection, but don't have an identifier. I worked on all the items to find the missing identifiers so the list should be empty.

Basic import[edit]

Add more statements[edit]

See also[edit]