Wikidata talk:Events/Data Quality Days 2022/Modeling data

From Wikidata
Jump to navigation Jump to search

The need for an improved autofix[edit]

Exactly one year after the Data Quality Days for which I prepared this presentation, I would like to expand my thoughts on one specific point, that is the section #Main enforcement methods. In that section I described 4 solutions: the two extremes, always valid, are manual intervention (1) and bot fixes (4); however, the two crucial ones are the ones that can be set in a precise way and periodically repeated as a routine, that is property constraints (2) and {{Autofix}} (3). Basically, both have a significant limitation: property constraints aren't used to fix directly items, but only to educate users and to give them lists of cases needing (probably) an intervention, whilst autofix has some limitations in the types of performed substitutions and other issues which I will describe more thoroughly in the table below. So, this thread (which will be referenced in an apposite Phabricator ticket) advocates for a system which basically integrates the pros of property constraints and {{Autofix}} into a new, better and stable system designed to enforce through periodical substitutions some specific constraints used to maintain a minimal coherence in our data model.

The following table describes the present issues and limitations of {{Autofix}} and the relative solutions which should be incorporated into the "improved autofix" I propose:

topic how it works presently in {{Autofix}} how it should work in the "improved autofix"
issue: place of storage {{Autofix}} is a template stored in the property talk pages; consequently:
  • it duplicates (to some extent) the constraints stored in property pages through property constraint (P2302)
  • it is impossible to query it
  • it isn't much visible
  • it could be not so easy to understand and use for users who are familiar with the editing of Wikidata items and properties, but not with Mediawiki templates
  • paradoxically, an IP/not-autoconfirmed user cannot edit property pages (per 2020 RfP), but can edit property talk pages, thus could add (maliciously or not) problematic autofixes, potentially leading to significant damage
"improved autofix" should store its data into content pages, so that:
  • data aren't a duplication of the constraints
  • data can be queried
  • data are more visible
  • it's easier for the users to understand them and to add new ones

The easiest solution would be storing the "improved autofix" into property pages; but I would add that, since an autofix concerns in most cases a property but also an item (e.g. autofix occupation (P106)member of parliament (Q486839) to position held (P39)member of parliament (Q486839) evidently concerns not only P106, but also Q486839), I would say that the autofix should be shown by the software - I don't know how (the problem is vaguely similar to phab:T209559) - also in Q486839

issue: periodicity and tracking of edits {{Autofix}} is used by KrBot to enact edits, usually on a daily basis (but the exact periodicity is not stated), and these edits are not tracked through single editgroups; consequently, undoing specific group groups of edits, performed because of a botched autofix, is extremely difficult "improved autofix" should be performed by a bot with a declared and stable periodicity, and the edits performed for each autofix should be tracked through single editgroups, so that they can be easily undone if needed
issue: prevention of bot wars {{Autofix}} triggers KrBot, as said, more or less on a daily basis; if a statement is added by someone, then autofixed, then readded, then autofixed etc. the cycle could go on indefinitely (see this example noted here; similar others exist); the consequence is flooding page histories, wasting resources and the time of users noticing the problem and of admins trying to patch it; since bots periodically syncing one external database with Wikidata are increasingly common, this problem is going to become worse in the near future "improved autofix" should incorporate some mechanism like the following: if the same exact replacement is done more than X (say 5) times in the same item, then the bot stops doing this replacement in that item, sends a standard alert to the user(s) who continues adding the to-be-replaced statement and also adds a report in an apposite page which can be monitored by admins
limitation: extent of the substitution {{Autofix}} performs replacements wherever a certain combination property/value occurs, viz. in main values and in qualifiers and in references; consequently, it is impossible to perform a replacement needed only in one or two of these fields (e.g. replace official website (P856) only in references with reference URL (P854)) "improved autofix" should allow to choose if a replacement is to be performed in main values and/or in qualifiers and/or in references (note: in 2022 I have proposed to add this function to {{Autofix}})
limitation: recursive subclasses {{Autofix}} replaces, for properties with datatype items, only single combinations property-item value (PX:QA can be autofixed to one of the following: PX:QB / PY:QA / PY:QB); consequently, it is impossible to apply an autofix to a given item and all its recursive subclass of (P279), unless creating one autofix for each item (which is sometimes impossible, and always very time-consuming; and, anyway, new subclasses can emerge afterwards) "improved autofix" should allow to replace not only a single item, but also a single item with all its recursive subclass of (P279)
limitation: combinations main value-qualifier(s) {{Autofix}} replaces only a combination [property-value] to another combination [property-value]; this "improved autofix" should allow to replace a combination [property-value + qualifier-qualifier value] to another combination [property-value + qualifier-qualifier value] or just [property-value]

Final summary: an "improved autofix", as envisioned, would both solve a series of issues of the present autofix (issues that, in some cases, could potentially cause real damage to the data, as shown above) and provide users with more options in order to try to keep coherent our data model (without having the skills, and time, to program an apposite bot to solve the most complex ones, presently not covered by autofix). --Epìdosis 00:02, 8 July 2023 (UTC)[reply]

how to progress[edit]

I'm interested in contributing here, both in design and, if possible, implementation. What is a good way to determine the design of an enhanced rewriting mechanism? The above suggestions, to me, are a good starting point but only a starting point. Peter F. Patel-Schneider (talk) 13:36, 1 December 2023 (UTC)[reply]