User:Multichill/Monument imports

From Wikidata
Jump to navigation Jump to search

This page describes how to do Monument imports and might serve as a basis for more general dataset imports.

We currently have the Monuments database which contains a bunch of sources. This data should be imported so that in the end we can abandon the monuments database in it's current form.

  • Make generators that return dictionaries (mysql or csv for example)
  • Make a bot that expects configuration file
  • Bot fetches a generator and works on the items

Generators[edit]

  • Csv generator
  • Mysql generator
  • Xml generator
  • Wiki template usage generator?

Matching[edit]

Loop over all items in the monuments database.

  1. Look if it has a monument article
    1. It has an article. Does the article have an item
      1. It has an item. Check if it has Rijksmonument ID (P359).
      2. It doesn't have an item. Let's create it
    2. It doesn't have an article
  • Does it have a wikidata id?
  • Does the wikidata item have a claim with the same id?
  • If not, import shit

Transform functions[edit]

  • Wikitext -> text (remove links and other garbage)
  • String -> article
  • Wikilink -> article
  • Article -> wikidata item
  • Lat/lon -> coordinates
  • Lookup field in dict -> value in dict (for example User:Metaodi#OGD_Zurich_Import)

monuments_nl_(nl) mappings[edit]

| monuments_nl_(nl) | CREATE TABLE `monuments_nl_(nl)` (

  • `objrijksnr` int(11) NOT NULL DEFAULT '0', - P359
  • `prov-iso` varchar(255) NOT NULL DEFAULT , - Administrative bla, dict
  • `woonplaats` varchar(255) NOT NULL DEFAULT , - Follow the link, find wikidata id
  • `adres` varchar(255) NOT NULL DEFAULT , - P969 (string) of P669 item with P670 as qualifier
  • `objectnaam` varchar(255) NOT NULL DEFAULT - Label in Dutch
  • `type_obj` enum('G','A') DEFAULT NULL, - drop it
  • `oorspr_functie` varchar(128) NOT NULL DEFAULT , - dictionary 233 items
  • `bouwjaar` varchar(255) NOT NULL DEFAULT , - fuck nested templates, skip it
  • `architect` varchar(255) NOT NULL DEFAULT , - leave it for now
  • `cbs_tekst` varchar(255) NOT NULL DEFAULT , - some might be useful
  • `lat` double DEFAULT NULL, - coordinates
  • `lon` double DEFAULT NULL, - again
  • `image` varchar(255) NOT NULL DEFAULT , - image claim
  • `commonscat` varchar(255) NOT NULL DEFAULT , - commonscat claim
  • `postcode` varchar(255) NOT NULL DEFAULT , - postal code P281
  • `buurt` varchar(255) NOT NULL DEFAULT , - drop it
  • `source` varchar(255) NOT NULL DEFAULT , - could use this as website source
  • `changed` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  • `monument_article` varchar(255) NOT NULL DEFAULT , - article to connect to if it's not already connected
  • `registrant_url` varchar(255) NOT NULL DEFAULT , - for sourcing

Todo[edit]

  • Somehow check that each monument contains a province and a municipality
  • Somehow check that each monument is instance of Rijksmonument and instance of something else (church/house/etc)
  • Figure out how to use P1134 (P1134)
  • Maybe at municipality in the source data
  • Maybe add Wikidata item id in the source data
  • I probably have to split and merge some items after import
  • Figure out the complexes. Is the data available somewhere in a machine-readable format?

Complex[edit]

I still have a local copy of this dataset. This dataset contains all the monument complexes

  • tblCOMPLEX contains the complexes.
    • COM_NUMMER is an internal number
    • COM_RIJKSNUMMER is the complex id
    • COM_NAME contains the name (might be empty)
    • COM_HFDOBJNUMMER contains an internal id to link to the "hoofd object" (main object)
  • tblOBJECT contains the Rijksmonumenten
    • OBJ_NUMMER is an internal number
    • OBJ_RIJKSNUMMER is the monument id
    • COM_NUMMER is the internal complex number (foreign key)
    • (more fields, but not going to touch that for the complexes)

Description[edit]

It should probably based on the address. So something like 'Rijksmonument op %(adres)%'

Street[edit]

Probably should be too difficult to extract the street and match it with an article.