User:Epìdosis/Strategy

From Wikidata
Jump to navigation Jump to search

Connecting Wikidata and library catalogs: reciprocal benefits[edit]

Benefits for Wikidata[edit]

  1. Possibility of identifying people with more certainty (i.e. confronting works listed by different library catalogs)
  2. Discovery of duplicate items
  3. Sources for aliases
  4. Source for descriptions
  5. Sources for statements (i.e. dates of birth/death etc.)

Benefits for library catalogs[edit]

  1. Possibility of comparing with plenty of other sources (i.e. authority IDs, encyclopedias, other databases etc.)
    1. Possibility of importing and/or showing these sources to the reader (i.e. AuthorityBox; cfr. also Google Knowledge Graph)
  2. Discovery of duplicate authority IDs
  3. Source for aliases
  4. Source for descriptions
  5. Source for data (i.e. dates of birth/death etc.)

Connecting Wikidata and library catalogs: effective actions[edit]

  • Creation of Mix'n'match (Q28054658) catalog with plenty of auxiliary data [note: this point should be expanded]
  • 0: Use Mix'n'match catalog in order to improve interconnection between Wikidata and the library catalog

Note: all changes made on Wikidata should be made also on Mix'n'match

  • When an authority ID is moved from one item to another, the change is always to be made also on Mix'n'match (in order to avoid their automatic reinsertion in the wrong item)
  • When an authority ID is deleted (because the library catalog has redirected or deleted it), the change is always to be made also on Mix'n'match (i.e. marking it as N/A, in order to avoid their automatic reinsertion)

Periodical revision on Wikidata [W][edit]

  1. Check unique-value constraint violations (in order to find duplicate items)!
  2. Check through queries all items having the authority ID but not some fundamental properties, and add manually (or import semiautomatically) missing information
    1. sex or gender (P21)
    2. date of birth (P569) or date of death (P570) or floruit (P1317)
    3. VIAF ID (P214)
    4. Aliases
    5. Description in the language of the cataloging agency

Periodical revision on library catalogs [L][edit]

  1. Check single-value constraint violations (in order to find duplicate authority IDs)!
    • Note: if a great number of duplicates is found through Mix'n'match, it may be good acting in the following way: merge all the duplicate authority IDs in the catalog; remove all the deleted authority IDs from Wikidata (if the property already exists); import a new Mix'n'match catalog and delete the old one
  2. Check through some internal way all authority IDs being present on Wikidata but not having some fundamental data, and add manually (or import semiautomatically) missing information
    1. sex or gender (P21)
    2. date of birth (P569) or date of death (P570) or floruit (P1317)
    3. VIAF ID (P214)
    4. Aliases
    5. Description in the language of the cataloging agency
  3. Compare through some internal way birth/death dates in authority IDs and in Wikidata (in order to find mismatches and to improve dates on both sides)
  4. Receive from Wikidata users reports of possible errors in the talk page of the property

Connecting Wikidata and library catalogs: institutions[edit]

Institutions effectively involved[edit]

Property 0 W1 W2 L1 L2 L3 L4
Pontificia Università della Santa Croce ID (P5739) doing periodically periodically periodically periodically
Unione Romana Biblioteche Scientifiche ID (P8750) doing scheduled periodically
Portuguese National Library author ID (P1005) scheduled
Museo Galileo authority ID (P8947) scheduled periodically periodically
Cyprus University of Technology ID (P9251) scheduled
Biblioteca Franco Serantini ID (P9178) scheduled

Engageable institutions[edit]

This list includes all the library catalogs for which a Mix'n'match catalog was scraped (or scraping is scheduled) in 2020-2021 by Bargioni, excluding the library catalogs for which a contact has been established (included in the previous list).

Rome
Italy
Greece and Cyprus

Europe
World

Good examples of error reporting[edit]

P.S. About conflations[edit]

I would distinguish two types of conflation:

  • just one ID is misplaced (conflation lato sensu)
  • many parts of the item (labels/descriptions/aliases, statements, IDs) refer to two different entities (conflation stricto sensu)

If the misplaced ID is very important for reconciliation (mainly VIAF ID (P214) and ISNI (P213)), the risk of degenerating from conflation lato sensu to conflation stricto sensu is high

  if ID is perfect (no duplications or conflations) actually (ID having duplications and conflations)
unique-value constraint violation
  • ID is misplaced in one (or both) cases
  • items are duplicated
  • ID is misplaced in one (or both cases)
  • items are duplicated
  • ID is conflated
single-value constraint violation
  • ID is surely misplaced in one (or both) cases
  • item is conflated
  • ID is misplaced in one (or both cases)
  • item is conflated
  • IDs are duplicated

Another way to find conflations would be looking at single-value constraint violations of date of birth (P569)/place of birth (P19)/date of death (P570)/place of death (P20), but they can mean:

  • many dates with different precision (e.g. day vs. year; the most precise should have preferred rank)
  • many dates supported by different sources (e.g. Wikipedia vs. an authority control; the statements supported by the most authoritative source should have preferred rank, statements supported by nothing or only by a Wikipedia should be removed)
  • item is conflated

In both cases (using IDs or using basic biographical statements) constraint violations are the best method to find conflations, but in fact conflations (and specifically conflations stricto sensu) are very rare in percentage and they are confused with a very big number of problems which are less damaging but still annoying, because they can degenerate in more serious problems (conflations lato sensu) and because they are annoying for Wikipedia infoboxes (double statements different in precision or in sources' authority) and finally because they preclude the discovery of more serious problems (conflations stricto sensu).

P.S. Sulle conflazioni (discursive explanation in Italian)


Secondo me i due strumenti fondamentali per beccare le conflazioni sono appunto le violazioni dei vincoli, in particolare:

  • violazione di valore unico può indicare o 1) duplicazione in Wikidata (elementi da unire) o 2) conflazione in Wikidata (l'id è presente in due o più elementi ma in uno o più elementi non c'entra e va rimosso) o 3) conflazione nell'id (l'id riguarda due o più elementi, andrebbe diviso)
  • violazione di valore singolo può indicare o 1) conflazione in Wikidata (uno o più d'uno dei due o più id presenti non c'entra e va rimosso) o 2) duplicazione degli id (il database contiene due o più id per lo stesso elemento, andrebbero uniti)

Come risulta dai due punti sopra, è fondamentale che la comunità si occupi di tenere il più possibile vuote le violazioni di vincoli, specialmente per quelle proprietà (penso a VIAF e ISNI) tipicamente usate per riconciliare nuovi database (se un VIAF è erroneamente presente in un elemento, rischia di coagulare attorno a sé altri id erronei - lo ho visto spesso accadere), e ciò secondo me non viene fatto abbastanza.

Ma c'è anche un altro punto rilevante che va messo in luce: molte violazioni di vincolo, soprattutto di valore singolo, dipendono da errori (duplicazioni, più raramente conflazioni) dei database cui Wikidata linka: stabilire, specie coi più importanti di essi, degli efficaci meccanismi di segnalazione e soluzione dei problemi individuati per mezzo di Wikidata (es. attraverso tool su Toolforge o altro modo che piaccia anche alla istituzione) aiuterebbe sia a migliorare i loro database sia a diminuire le violazioni di vincolo, rendendo più facile individuare quelle violazioni di vincolo che pertengono specificamente a Wikidata stessa (cioè duplicazioni e più raramente conflazioni su Wikidata stessa).