A data schema is a set of rules that govern a database. Data schemas would provide Wikidata with a standardised structure for data on a subject area e.g all the items on museums would use the same structure to describe basic facts about them e.g location, collection type, date opened etc. Many schemas for kinds of data Wikidata stores are already being used by many other databases e.g on schema.org.
Benefits of having community agreed data schemas on Wikidata
There are often several ways to describe the same kinds of data on Wikidata, however without community agreement on how to model subjects, items within the same subject are described inconsistently which is causing issues with data quality, usability and community growth and health. Data schemas would act as maps for different subjects for people to follow when exploring and adding data to Wikidata.
Increases data quality and data completeness by allowing people to find and use the most appropriate schema for the subject.
Provides data that could be used to improve the ‘property suggester’ tool (statements which are part of the schema could be used as the highest priority suggestions).
Cradle allows people to create schemas, however they are created by one user and the schema is recorded on a separate database and not part of Wikidata. These schemas could be used as starting points for community discussions to agrees schemas.
Model items and showcase items both present best practice for describing a subject but do not explicit describe what to include and are not include a defined schema e.g Douglas Adams is a model item for a person but few of the statements for Douglas Adams are applicable to people in different professions.
A central discussion area, similar to Wikidata:Property proposal but would be used to collaboratively develop schemas based on existing schemas for the subject, research, how they relate to other Wikidata schemas etc. The 'hub' could use FormWizard and Visual Editor to lower the barriers to participation.
Wikidata schemas could be started from:
External existing schemas such as ones used in different sectors and websites like schema.org.
Existing Wikidata models like featured items, model items and Cradle schemas.
Data schemas could be recorded in their own Wikidata items e.g an item called 'Wikidata schema for human' where 'Instance of = Wikidata schema'. Using Wikidata items to record data schemas would:
Be understandable by people and machines meaning they could be used in tools.
Make them easy to understand and copy as they would show the structure for the schema in the same format as the items which used the schema.
Questions:
How to record what the schema is for? A property called 'Schema for' would allow the qualifier to be used to allow schemas to be more granular e.g 'author' vs '18th century author'
How to describe how schemas relate to each other e.g how schema for person and schema for painter link together
What to include as values in the example? Should they all just link to an example value? How to do values where the value is a number not an item?
How could you define the most wanted or most valued statements in a schema? Having a statement ranking would:
Provide a metric for item quality, understand which items have the most important statements.
Provide additional data for automatic property suggestion (statements with greatest importance to the model could be the first properties suggested on new items).
Ways of finding and exploring Wikidata data schemas
How to record what the schema is for? A property called 'Schema for' would allow the qualifier to be used to allow schemas to be more granular e.g 'author' vs '18th century author'.
What to include as values in the example? Should they all just link to an example value? How to do values where the value is a number not an item?
How to describe how schemas relate to each other e.g how schema for person and schema for painter link together.
How could you define the most wanted or most valued statements in a schema? Having a statement ranking would:
Provide a metric for item quality, understand which items have the most important statements.
Provide additional data for automatic property suggestion (statements with greatest importance to the model could be the first properties suggested on new items).
How to capture agreement on granularity of items? E.g a museum as one item or as two items, the building and the legal entity