Wikidata talk:Tenth Birthday/Wikidata 10th Birthday in Utrecht, the Netherlands

From Wikidata
Jump to navigation Jump to search

Summary of the tabular style data discussion[edit]

user:Zolo user:Jheald user:Moebeus user:Stevenliuyi user:Theklan Waldyrious (talk) Sj

Notified participants of WikiProject Tabular data Also pinging some of the people I remember participated (I apologize for those I forgot or accidentally added) @Siebrand, Spinster, OlafJanssen, Daanvr, Egon Willighagen:.

The following is a summary of the discussion of ~8 Wikidata users at the Wikidata 10th Birthday, Utrecht 2022-10-29. The discussion was started with this introduction:

Given that items with too many statements[1] are hard to handle, both from a technical point of view (the UI can crash) and from a user experience point of view (almost infinite scrolls and hard to get an overview) and the possibility to store tabular data on Wikimedia Commons, can we imagine another way to handle tabular style data on Wikidata? For this session, any properties whose values are expected to change over time like changing politicians, populations etc. (and not very unusual changes like an island ceasing to exist) are the center of the discussion. We call this "tabular style data".

To get the discussion started, this provoking proposal was made
  • Only the latest value may get added to a statement.
  • Everything historical instead is added to a tabular file on Wikimedia Commons.
  • Each such statement can have a qualifier pointing to a page in the data namespace on Commons.
    • The qualifying property (or properties) should be quite generic, perhaps labeled as "tabular data related to this statement" or "tabular historical data for this statement" to avoid hundreds of new properties.
    • Naming of the file on Commons could be based on the Qid and property to make it predictable
Open questions
  • How to keep the data file in sync?
    • How should the user interface look like, can it automatically 'move' an old value when a new one is added?
  • How to query historical data?
  • How to use historical data in client wikis?

Discussion[edit]

When explaining the problem, a few other problem areas were raised (even though their data may not change often):

  • Journal articles with several hundred authors
  • Chemical compounds - exist in species
  • Genes!
  • Many "full source available at" links from manuscripts, with one link per page

We then discussed both the problem, and ideas for solutions based on the proposal. However, it was consensus in the discussion that the proposal would be a workaround and that it would be better to instead engineer solutions that would make it possible to keep the data on Wikidata. That is, making sure Wikidata itself, or the browser, will not crash due to the number of statements on it. Only if the message from the developers is that this is technically not feasible should the proposal be seriously considered.

An idea for a partial solution to help the user could be to "collapse" statements with many values, and perhaps not even loading that data before unfolding that statement.

It was also consensus that if the proposal was chosen instead of resolving the underlying problem it would be a prerequisite that it would be possible to access that data, at minimum in a query, but preferably also in client wikis. (We also noted that this would be useful to have access to regardless of implementing the proposal or not, as some data will always only fit on Commons).

For participants, please add if I forgot something. For others, please contribute to a continued discussion so that we can ask the developers how we would like them to solve this issue.

  1. Like COVID-19 pandemic in Costa Rica (Q87477462)

Ainali (talk) 20:26, 27 November 2022 (UTC)[reply]

Thanks for bringing this here! We are working on uploading climate data to Commons and link it to Wikidata items. The main problem is searching and querying it. Wouldn't it be great to have meteo stations colored by highest temperature? Or by difference of median temperature in a decade? I think that this phab task is relevant: Theklan (talk) 09:13, 28 November 2022 (UTC)[reply]
My two cents here. Most of the cases I would like to have tabular data accessible are for point in time (P585). Let's imagine two cases: population and weather.
I have a population (P1082) property and data changes every year. I should have the last one and, then, a tabular data. We could use tabular population (P4179) or, way better, something generic. The tabular data is stored in Wikidata (after all, it's data), and each column is properly linked to a property. In this example, there should be three columns. The first one is invariant, and it refers to the item itself we are talking about. Then a column called population (P1082) and a third one called point in time (P585). Then, when I'm in a Wikipedia article or elsewhere, I could query the last population, or make a historic graph or, why not, ask for the population in 1980, because I want to make some comparisons by decade.
For weather it could be exactly the same, but instead of population (P1082) I have maximum temperature record (P6591) and minimum temperature record (P7422) or something better (because is not absolute record, but daily record or something like that). I can have other properties like accumulated rain, wind speed or average temperature. The usual suspects when using a weather history (P4150). So, the csv or tabular data should behave in the same way: I can query about the max temperature in a given day, range or whatever, and plot it in a graph or show it as a result in a table.
The main difference between these examples is that we are measuring different Properties, but they share something in commons: they have point in time (P585) for an item. So this is the constraint we need here.
I don't know how to query it, but I guess that storing tabular data in Wikidata should be the best solution. Theklan (talk) 09:47, 28 November 2022 (UTC)[reply]
I agree with what has been said and I am a big supporter of large tabular data (as defined here) stored outside the usual Wikidata triples. My instinct is that it would be better if we could store metadata for each dataset directly on Commons. Adding queryable structured data to Tabular datafiles on Commons would be one solution. If that's not technically feasible, we'll have to manually link the tabular data files to Wikidata items for the time being.
The whole situation is a bit like Catch-22. There is no community-defined standard for the format of Commons data files because the community is not interested in a data file that is not queryable from Wikidata. On the other hand, we probably cannot expect the developers to add new functionalities to Commons tabular data before clear standards are formulated and followed.
A good beginning would be to make a list of available tabular data files on Commons, or at least start a category system (Phab:T242596), to get an overview about the various types of data currently stored like this. Vojtěch Dostál (talk) 13:27, 28 November 2022 (UTC)[reply]
@Vojtěch Dostál Just to be clear, do you disagree with the consensus of the discussion that, if possible, we should not use Commons and instead improve Wikidata? That is, would you prefer us to use Commons even if it was possible to make Wikidata handle large amounts of tabular style data better? Ainali (talk) 17:17, 28 November 2022 (UTC)[reply]
@Ainali Sorry, I didn't see that consensus sentence in bold before. I have no preference where the tabular data should be hosted and what the technical solution should be. However, as ArthurPSmith puts it on Project Chat, "something needs to change" and it better be a solution that is easy to maintain and query. Vojtěch Dostál (talk) 17:49, 28 November 2022 (UTC)[reply]