User:Magnus Manske/Mix'n'match date import

From Wikidata
Jump to navigation Jump to search
  • Mix'n'match has lots of "people entries" (biographies) from various third-party catalogs, both matched and unmatched to Wikidata
  • Many of these people entries have birth and/or death dates in their description
  • Reliably extracting these generically is hard/impossible
  • I wrote a script that has specific code (mostly, regular expressions) for each catalog, where possible
  • The aim is to reliably extract dates from a specific catalog, to ensure the date is what the catalog states; some dates may be skipped if they are in unusual form, contain "fuzzy" keywords like "died before", etc.
  • This has yielded ~2.7 million birth and/or death dates (years, year-month, or year-month-day), so far
  • These are stored in a separate table in Mix'n'match
  • That data are used when creating a new Wikidata item from Mix'n'match
  • That data can also be used by a bot to add dates (where missing) and/or references to birth/death statements. Catalog-independent bot code exists:
    • A test edit is here.
  • That data can also be used to find new matches, e.g., two Mix'n'match entries with identical, day-specific birth and death dates, with one entry matched to Wikidata but not the other, yields a strong candidate for matching to the same item. Reconciling this could become a separate function in Mix'n'match, or a Game. No code exists yet

Technical

[edit]
  • If you have a ToolForge (formerly "WMF Labs") user account, you can access the dates in the public-readable database "s51434__mixnmatch_p"; table is "person_dates", field "entry_id" links to "entry.id"
  • The code for script extracting the dates from the catalogs is here
  • The code for (preliminary) bot script to add dates and/or references to Wikidata is here