Wikidata:WikiProject Wikidata for research/EINFRA-9-2015

From Wikidata
Jump to navigation Jump to search


Contributions to the "Wikidata for research" project (including Wikidata:WikiProject Wikidata for research and all its pages) are dual licensed under CC BY-SA 3.0 (the Wikimedia default) and the Creative Commons Attribution 4.0 license.
Contributions by the project to the item and project namespaces of Wikidata shall be under CC0.

About[edit]

This page hosts post-submission updates about a grant proposal that was drafted in public and submitted on January 14, 2015, in response to the Horizon 2020 (Q13583472) EINFRA-9-2015 call.

Submitted version[edit]

  • Title: Enabling Open Science: Wikidata for Research (available via ZENODO: doi:10.5281/zenodo.13906 under a Creative Commons Attribution 4.0 License). Due to many copyrighted letterheads, the letters of support from the Associate partners could not be made available under an open license, which precluded inclusion in the Zenodo repository or upload to Wikimedia Commons. They are available separately.

Background[edit]

Explanation in lay terms[edit]

The project proposes to create a "virtual research environment" (VRE) on Wikidata supporting both the science communities and the open-knowledge Wikidata community. Wikidata has already become a major focus point to openly share scientific information. However, the existing infrastructure for data lookup needs enriching to enable new kinds of interaction with professional science organisations in a mutually beneficial way.

The proposal does not aim to develop a virtual research environment in the sense of an "information silo": feature-complete, secure, self-contained, but also isolated and typically a discipline-specific "remote-desktop" system. Rather, it is based on the realization, that the web itself, in the form of globally interconnected data (linked open data) and services is the VRE of the future. With limited resources, the proposal will therefore focus on investigating and developing the functionality of Wikidata for professional scientific research.

Professional scientists and researchers as well as citizen scientists (including "citizen data scientist") will be able to use this environment. A popular application of this will be searching the intersections of data collections, e.g. linking public (governmental) data with research data. An example, would be to combine epidemiology data for a disease (by country, by year) with public sales data of products like drugs or food or events like concerts or movies. These forms of analyses are something that Wikidata is designed to do, but where the current service interfaces for humans and machines are insufficient as yet. With this virtual environment, one would be able to make any of these requests:

  • Wikidata, please graph the relationship between...
    • number of hospitals per population and the incidence of tuberculosis cases in cities in England from 2000-2010
    • the obesity rates of people age 25–40 against the number of schools in a city limit
    • incidence of annual flu cases versus the number of professional sports events in a city

There is no suggestion that the correlations which Wikidata will be able to graph will lead to conclusions about causation, but having this kind of power in public hands and especially having the power to tie everything that is already reported to everything else which is already reported will become the basis of much future research.

This proposal is significant because no other open collaborative project – “virtual research environment”, in EU parlance – can connect the free databases in the world across disciplinary and linguistic boundaries. With the inclusion of Freebase into Wikidata in 2015, the project will be capable of providing a unique open service: for the first time, that will allow both citizens and professional scientists from any research or language community to integrate their databases into an open global structure, to publicly annotate, verify, criticize and improve the quality of available data, to define its limits, to contribute to the evolution of its ontology, and to make all this available to everyone, without any restrictions on use and reuse.

Scientific content[edit]

Open-knowledge projects have created a remarkable knowledge infrastructure in the past years, consisting, e.g., of the Wikipedias, the structured DBpedia, and (gold) Open Access publication infrastructures. The EU is a leader in these activities.

Open-knowledge infrastructures are of great societal importance: they have become a basic utility in the discourse of an educated public with science, commerce, and politics. It is in the interest of society to facilitate access to such infrastructures for both humans and machines, and to use them to their full potential in research by involving the public much more in the creation and curation of knowledge than has been possible.

Information is often harvested from qualified and open scientific data sources and uploaded by automated processes ("bots") to Wikipedia in a one-way information flow. DBpedia, harvesting and aligning Wikipedia data, has made lots of data from the Wikipedias re-usable. Because of its breadth and its ability to connect many information sources, it is a central node in the Linked Open Data Cloud. However, the relation so far has been one-sided: it is difficult for scientific data providers to efficiently interact with traditional open knowledge and citizen science curated information systems. Furthermore, in the age of linked open data, where the web itself has become a globally integrated knowledge source, identifiers to concepts have become a basic form of language and a requirement for scientific discourse.

The advent of Wikidata is a game-changer in this respect. It allows for data to be curated in a structured database in a service-oriented architecture, where humans and API-driven algorithms can interact efficiently. Its expanding collection is not an extraction of volunteer's work: it is tailor-made by and for a community that continuously improves its data model, infrastructure and content. Wikidata has thence gradually emancipated from the traditional triples (subject-predicate-object) to a very rich information scheme (that also includes references and qualifiers).

The goal of the proposal is to strengthen the interaction between citizen scientists working in the context of open knowledge projects and the professional scientific community. The focus will be on data about things or concepts that are relevant to societal dialogue, such as names of biological organisms or agents, features or traits, chemicals, historical facts and social interactions.

Timeline[edit]

2015[edit]

2014[edit]

Related information[edit]

See also[edit]