Shortcut: WD:WPR
Wikidata:WikiProject Redundancy
The primary aim of WikiProject Redundancy is to reduce the amount of Wikidata's data—without reducing the amount of information in Wikidata!—for the well-being of Wikidata, its community, and its downstream users.
Motivation
[edit]Wikidata's growth in recent years has sparked concerns about the likelihood of the collapse of its Query Service (WDQS) and the increasing inability to edit many of its larger items. Much of it stems from a considerably large amount of data within it being stored unnecessarily, both when this information is not actively used elsewhere on Wikidata and when the information represented by this data can be readily and reliably computed in other ways.
This WikiProject seeks to keep the amount of information on Wikidata constant while reducing the overall size of its data, both in terms of the lengths of item pages on wikidata.org and the number of RDF triples in WDQS. It will distinguish between several types of action that may be taken, including 1) what can in principle be done right now without affecting existing workflows, 2) what is also possible now but may require acceptable changes to queries for accommodation, and 3) what is not currently feasible since it necessitates software changes and possibly entirely new storage units. It is expected that some of the proposed actions may be controversial, but we hope to foster discussion about these taking into account Wikidata's site health, community health, and usability.
It is hoped that, depending on the types of action described, participants will be inspired to either take these actions directly or encourage those who develop Wikibase, its Lua interface, and WDQS to make appropriate changes and improvements so that those actions can later be taken.
Data size aspects
[edit]There are two ways to measure Wikidata's size:
- the number of RDF triples (of relevance to WDQS); and
- the size of the Wikidata dump (whether in JSON or TTL; of relevance to external users).
The main difference between these is that adding a reference, quantity, time, or coordinate that exactly duplicates another elsewhere in Wikidata adds relatively more to the dump size than to the RDF triple count.
For statistics regarding the number of RDF triples in WDQS, cf. User:AKhatun/WDQS Triples Analysis (2021) and User:Mahir256/Triples (2022).
Actions that can be taken
[edit]Editable data on Wikidata
[edit]Ongoing and uncontroversial
[edit]- Merge duplicate items - see Help:Merge and Wikidata:WikiProject Duplicates
- Removing references to other Wikimedia projects (i.e. containing imported from Wikimedia project (P143) and/or Wikimedia import URL (P4656)) when third-party references are also present for the same statement
- Example: George Lopez's date of birth
- (Most references to Wikipedia should be substituted wherever they occur anyways!)
- Removing described at URL (P973) statements on (item) when the external ID corresponding to their value is also present
- Example: this P973 statement, which replicated the value of P9951
- QLever general query: https://qlever.cs.uni-freiburg.de/wikidata/b2y2UW
- Removing date values without references (or referenced only through imported from Wikimedia project (P143)) if a more precise value, referenced without using imported from Wikimedia project (P143), is also present - apply only if the more precise value is compatible with the less precise value to be removed
- Example: Alix Faucigny-Lucinge's date of birth
- cf. removals performed by User:MatSuBot on the basis of Wikidata:Requests for permissions/Bot/MatSuBot 7 (2017; the WDQS query once used to find removable values times out as of 2024)
- Removing more generic occupation (P106) values if they don't have references or are referenced only through imported from Wikimedia project (P143) (e.g. remove philologist (Q13418253) if classical philologist (Q16267607) is present)
- Example: Dorothy Cadzow's occupation
- If the more generic occupation (P106) is has references, how to manage it (keep with normal rank and set the more precise ones with preferred rank; keep with deprecated rank; just remove) has been discussed in Wikidata:Project chat/Archive/2023/10#Concept of bot edits (2023) with no clear conclusion
- Removing duplicate external IDs if the same value is present twice, the good one with qualifiers (typically subject named as (P1810)) and the bad one without qualifiers
- Example: Rolf Mühlethaler's NKC without qualifiers
- WDQS query for NL CR AUT ID (P691): https://w.wiki/AuLf
Will likely require consensus
[edit]- Removing described by source (P1343) statements on (item) when their value has main subject (P921) (item)
- Example: the 4th Earl of Sandwich and articles about him in DNB00 and EB1911
- Alternative: use such values as references on the item to substantiate that the 'source' has a 'description' of the item
- On references, removing stated in (P248) (item) where an identifier with applicable 'stated in' value (P9073) (item) also present
- Example: Bart Simpson's given names
- This does not mean removing stated in (P248) claims where no such identifier property exists on the reference!
- Removing certain inverse properties wherever they are used (e.g. 'student' or 'doctoral student')
- Example: Willi Jäger's doctoral students
- This is meant to apply to properties where a single item more frequently has a lot of values for it!
- Removing most occurrences of certain properties that may be inferred from a hierarchy (e.g. located in time zone (P421), has works in the collection (P6379))
- Example: Peter Paul Rubens (Q5599)
- How long does it take for that item to load for you?
- Removing sport (P641) statement where it can be inferred from another occupation (P106) statement.
Deprecated statements
[edit]- if a deprecated statement has always been wrong and is kept only because of the risk of it being reimported if removed, we should try to have it corrected in its source and then remove it safely -> cf. Wikidata:Data round-tripping (about obtaining fixes in external websites when we find mistakes in them)
- if an external ID has been deprecated because it has been redirected or deleted, are we sure we should keep it (and how we should keep it)? - cf. ongoing Wikidata:Requests for comment/Handling of stored IDs after they've been deleted or redirected in the external database (2020-; stalled with no consensus)
Possibility of moving data to external databases
[edit]- Move scientific articles to a separate database (as suggested by User:Wittylama). This would have many drawbacks and cost a lot of engineering effort because linking between Wikibases is not yet a trivial process. The graph split, to be finished during 2024 by WMF, is a middle ground where the items stay in Wikidata but are split in WDQS and accessed via federation by queries that need that data.
- Move astronomical data to a separate database (a question for User:Ghuron, possibly). This would have many drawbacks and cost a lot of engineering effort because linking between Wikibases is not yet a trivial process. Another graph split is another option and a middle ground where the items stay in Wikidata but are split in WDQS and accessed via federation by queries that need that data.
Technical fixes needed
[edit]For Wikibase
[edit]- implement multilingual (mul) labels and aliases in order to reduce redundant labels and descriptions -> T285156: Add termbox language code mul to reduce redundancy in Wikidata labels and aliases
- make it impossible to set the same string as label and alias for the same language -> T157774: Make it impossible to set the same content in the same language for label and alias
- improve technically the management of inverse statements, e.g. storing the statement only in one item and showing its inverse by default in the other (e.g. not 'wdt:P150', but '!wdt:P131', so this would allow deleting P150; see also MediaWiki:Gadget-relateditems.js) -> T209559: Inverse statements duplicate work, data, and may be out of sync
- avoid dates with precision lower than month (10) being stored with two different encodes, which often causes duplicate statements -> T310981: Duplication of dates due to different encode[1]
- cf. deduplications performed a posteriori (only for date of birth (P569) and date of death (P570)) by User:MatSuBot on the basis of Topic:Wxspna7q8jnn17u8 (2021-)
- make it impossible to add exactly duplicate claims (but it could happen that two claims that aren't intended to be identical may be identical for a few seconds while they are being edited) -> T222274: Fix and prevent duplicate claims on Wikidata entities
- References
- store reference data separately in the JSON rather than each time a reference is used (e.g. the JSON representation of Q62513493) -> T360224: Improve Wikidata handling of duplicate references in model and UI[2]
- make impossible to store exactly duplicate references -> T224333: It's possible to save a statement with duplicate references[2]
- avoid storing references that differ only for the value retrieved (P813) -> T270375: Saving identical references with different retrieval dates should be more difficult
- in general, avoid storing references that share most of their content, e.g. both having the same external ID, or one having the external ID X and the other having the corresponding reference URL (P854) -> no Phab yet
- cf. substitutions performed a posteriori by User:DifoolBot on the basis of Wikidata:Requests for permissions/Bot/DifoolBot 5 (code on Github) (2024-)
- avoid that merges create redundant statements, i.e. statement X with no qualifiers and references and statement X with 1+ qualifiers should be merged but aren't marge by Special:MergeItems as of now -> T334197: Improve Special:MergeItems for qualified and not-qualified statements
For Wikidata
[edit]- automated descriptions on the basis of instance of (P31), e.g. for Wikimedia disambiguation page (Q4167410) and Wikimedia category (Q4167836), or other properties -> T303677: Automatically generate descriptions for items based on their P31 (instance of) values
For Commons
[edit]- move tabular data to the Data namespace of Wikimedia Commons -> T181319: Support external tabular datasets in WDQS
Participants
[edit]The participants listed below can be notified using the following template in discussions:{{Ping project|Redundancy}}
References
[edit]- ↑ Mentioned in meta:Community Wishlist/Wishes/Fix main bugs in Wikidata and WDQS handling of dates.
- ↑ 2.0 2.1 Mentioned in meta:Community Wishlist/Wishes/Improve Wikidata handling of duplicate references in model and UI.
Related links
[edit]- Other WikiProjects
- WikiProject Duplicates (duplicate items are an obvious redundance that should be reduced inasmuch as possible, mostly - but not only - through manual merges)
- Wikidata:WikiProject Limits of Wikidata
- Presentations
- Mahir256, Making Wikidata Smaller Without Reducing Information (online at WikidataCon 2023) - recording
- Essays