User:John Cummings/modelRFC

Published at https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Wikidata_to_use_data_schemas_to_standardise_data_structure_on_a_subject

Background information

What is a data schema?

A data schema is a set of rules that govern a database. Data schemas would provide Wikidata with a standardised structure for data on a subject area e.g all the items on museums would use the same structure to describe basic facts about them e.g location, collection type, date opened etc. Many schemas for kinds of data Wikidata stores are already being used by many other databases e.g on schema.org.

Benefits of having community agreed data schemas on Wikidata

There are often several ways to describe the same kinds of data on Wikidata, however without community agreement on how to model subjects, items within the same subject are described inconsistently which is causing issues with data quality, usability and community growth and health. Data schemas would act as maps for different subjects for people to follow when exploring and adding data to Wikidata.

Data quality

Increases data quality and data completeness by allowing people to find and use the most appropriate schema for the subject.
Provides data that could be used to improve the ‘property suggester’ tool (statements which are part of the schema could be used as the highest priority suggestions).

Usability

Make data easier to find and use, including being able to have simpler and more consistent queries.
Could be used by query tools like ProWD to understand data completeness.

Community growth and health

Help people learn Wikidata more easily by having clear instructions to follow.
Reduce arguments and increase community health by having clear rules on data structure that everyone can follow.

Existing work on data schemas on Wikidata

Cradle allows people to create schemas, however they are created by one user and the schema is recorded on a separate database and not part of Wikidata. These schemas could be used as starting points for community discussions to agrees schemas.
Model items and showcase items both present best practice for describing a subject but do not explicit describe what to include and are not include a defined schema e.g Douglas Adams is a model item for a person but few of the statements for Douglas Adams are applicable to people in different professions.
Constraints ???
Shape expressions, ??I have no idea???

Data schemas on Wikidata

A central place to discuss and create data schemas collaboratively

A central discussion area, similar to Wikidata:Property proposal but would be used to collaboratively develop schemas based on existing schemas for the subject, research, how they relate to other Wikidata schemas etc. The 'hub' could use FormWizard and Visual Editor to lower the barriers to participation.

Wikidata schemas could be started from:

External existing schemas such as ones used in different sectors and websites like schema.org.
Existing Wikidata models like featured items, model items and Cradle schemas.

Recording agreed data schemas in Wikidata items

Data schemas could be recorded in their own Wikidata items e.g an item called 'Wikidata schema for human' where 'Instance of = Wikidata schema'. Using Wikidata items to record data schemas would:

Be understandable by people and machines meaning they could be used in tools.
Make them easy to understand and copy as they would show the structure for the schema in the same format as the items which used the schema.

Questions:

How to record what the schema is for? A property called 'Schema for' would allow the qualifier to be used to allow schemas to be more granular e.g 'author' vs '18th century author'
How to describe how schemas relate to each other e.g how schema for person and schema for painter link together
What to include as values in the example? Should they all just link to an example value? How to do values where the value is a number not an item?
How could you define the most wanted or most valued statements in a schema? Having a statement ranking would:
- Provide a metric for item quality, understand which items have the most important statements.
- Provide additional data for automatic property suggestion (statements with greatest importance to the model could be the first properties suggested on new items).

Ways of finding and exploring Wikidata data schemas

A statement for 'Wikidata schema' added to the Wikidata item for the subject.
A search function in the central place to discuss schemas with some categorisation (like property proposal).
Some way of exploring how items in the same class relate to each other e.g the schema for humans and the schema for writers.
Show how far up the class tree the schema exists e.g the schema for a person would not include information about the taxonomic tree.

Outstanding questions

How to record what the schema is for? A property called 'Schema for' would allow the qualifier to be used to allow schemas to be more granular e.g 'author' vs '18th century author'.
- What to include as values in the example? Should they all just link to an example value? How to do values where the value is a number not an item?
How to describe how schemas relate to each other e.g how schema for person and schema for painter link together.
How could you define the most wanted or most valued statements in a schema? Having a statement ranking would:
- Provide a metric for item quality, understand which items have the most important statements.
- Provide additional data for automatic property suggestion (statements with greatest importance to the model could be the first properties suggested on new items).
- How to capture agreement on granularity of items? E.g a museum as one item or as two items, the building and the legal entity