User:ArthurPSmith/DraftRfC References

From Wikidata
Jump to navigation Jump to search

From the session on alternative reference models at Wikidata:Events/Data Modelling Days 2023 there emerged two proposals for developers to improve the treatment of duplicated references on Wikidata items: 1. Condense the internal JSON storage so that duplicate references are stored in full only once per item, and 2. Modify the Wikidata UI representation of items with duplicated references to allow editing of all copies of a reference with one edit, rather than separate edits for each statement the reference is on. This RFC is to allow the wider community to comment on these proposed changes prior to creation of detailed requests for the developers (anticipated to be made in January 2024).

1. Condense internal JSON storage for duplicate references[edit]

Duplicate references can significantly impact the size of items under the current storage format. As an example see Q21481859 which has almost 3000 authors who (should) all have the same reference; the duplicated reference data accounts for over 1 MB of the 4.4 MB size of the item. Each statement with a reference has a "references" attribute in the JSON that looks something like:

"references": [ { "hash": "51ae109329c13aebb6e83e53e1583cf93312f9e6", "snaks": {
"P248": [{ ... }], "P813": [{ ... }], ...}, ...]

It is proposed that these per-statement entries be replaced with an indirect reference using the hash:

"references": ["51a109329c13aebb6e83531583cf93312f9e6"]

(perhaps "references" should be replaced with "reference_hashes", or the entries maintained as JSON objects with attribute "ref_hash"). Then the item would have its own "references" attribute that contains the full duplicated reference entries (once per hash value):

"references": [ { "hash": "51ae109329c13aebb6e83e53e1583cf93312f9e6", "snaks": {
"P248": [{ ... }], "P813": [{ ... }], ...}, ...]

This could be implemented by something that translates between the current format and this storage format, and back (so that no change would be needed at higher levels such as UI, API's, etc.) or it could be a more integrated change that could improve performance and size for the other layers also.

Other techniques for shrinking the storage format should also be considered (removing unneeded whitespace characters, a compression solution like gzip, changing the storage format to be closer to the REST API format that is already more compact without loss of information) - but these may have other implications and would need to be considered separately.

2. Modify the Wikidata UI for editing duplicated references[edit]

If a duplicated reference needs to be modified in some way, right now that requires a separate user interaction with each instance of the reference. Whether or not the above storage change is implemented it would be helpful for the user interface to allow simultaneous editing or deleting of all duplicate copies at once. There may be other UI changes that would also be useful for handling duplicate references. Some suggestions follow:

  • A. When editing a reference, highlight other statements that have the same reference. Add a checkbox in the reference editing area with label
 apply changes to all copies of this reference □

and then update all matching reference entries with the "publish" action (if box checked).

  • B. When adding a reference to a statement, allowing adding an existing reference on the item (maybe the DuplicateReferences gadget is sufficient?)
  • C. Add a new section "References" under "Statements" and "Identifiers", with each reference listed only once, and indicate references on the statements with something like the standard wikitext format.[1] Editing references on each statement would bring in the full reference text for editing, but editing references in the Reference section would make the change applicable to all copies of the reference at once.

References[edit]

  1. Like this

Comments[edit]

Please indicate support for the above proposals, or add any comments on how you think they should be adjusted or other considerations that may need to be reviewed.

  •  Support in general. With regard to the use of the Duplicate References gadget, I often use User:Bargioni/UseAsRef.js multiple times when inserting a reference for many statements, so we should consider a solution for that process as well (which might be to recommend using UseAsRef once and then using Duplicate References for the rest). - PKM (talk) 20:48, 5 December 2023 (UTC)