Wikidata:WikiProject Wikidata for research/EINFRA-9-2015/Notes from drafting

From Wikidata
Jump to navigation Jump to search


Contributions to the "Wikidata for research" project (including Wikidata:WikiProject Wikidata for research and all its pages) are dual licensed under CC BY-SA 3.0 (the Wikimedia default) and the Creative Commons Attribution 4.0 license.
Contributions by the project to the item and project namespaces of Wikidata shall be under CC0.

This page contains notes that have been used for drafting the EINFRA-9-2015 proposal Enabling Open Science: Wikidata for Research (Wiki4R), which are kept here for archival purposes.

Summary[edit]

The aim of the project is to draft:

In the long run, WikiProject Wikidata for research is intended to facilitate activities at the Wikidata/ research interface more generally (e.g. with a regional or disciplinary focus).

Proposal management[edit]

Call[edit]

"These virtual research environments (VRE) should integrate resources across all layers of the e-infrastructure (networking, computing, data, software, user interfaces), should foster cross-disciplinary data interoperability and should provide functions allowing data citation and promoting data sharing and trust. "

Political context[edit]

We are collecting key EU policy information in this document.

Submission forms[edit]

The EU has defined very precisely in which form proposals have to be submitted in response to this call. Since those template documents are not under an open license, they cannot be copied here. So we have copied them into Google docs that are editable by anyone.

The docs:

B1-3 B4-5 Excellence Impact WP1 WP2 WP3 WP4 WP5

Timeline[edit]

The draft has now been completed and submitted.

Upcoming[edit]

  • January 14: Submission deadline is 17:00 Brussels time
  • January 15:
  • January 16:
    • blog post
  • January 17: start thinking about follow-up activities for WikiProject Wikidata for research
    • national/regional proposals
      • Germany (DFG)
    • discipline-specific proposals ("Structured X")
      • Biodiversity
      • Chemistry
      • Computer science
      • History
      • Mathematics
      • Medicine

Past[edit]

  • January 13:
    • drafting continued on remaining parts of the proposal
    • test submission successful
  • January 12:
    • test submission failed
    • drafting continued on all parts of the proposal
  • January 11:
  • January 10:
    • drafting continued on all parts of the proposal
  • January 9:
    • drafting continued on all parts of the proposal
    • multiple letters of support/ collaboration/ intent received from Associate partners
  • January 8:
    • continued to flesh out Excellence, Impact, workpackages
    • suggested start and end month for some tasks
    • first concrete budget estimate
  • January 7:
  • January 6:
    • Objectives section fleshed out
    • GDocs switched from publicly editable to view/ comment only
  • January 5:
    • a flurry of post-holiday activity
    • drafting continued on all aspects of the proposal, with a focus on objectives and workpackages
    • added a new Task "Wikidata for cultural heritage" to WP4
    • suggested new Task "Wikidata as a repository for nanopublications" for WP4
    • revised timeline
  • January 4:
    • continued drafting deliverables and milestones
    • reframed WP4 as "Enabling the use of Wikidata in research contexts"
    • started to flesh out Workpackage 4
    • moved the LOD for Wikidata task (3.4) into WP4
    • moved the Wikidata identifiers in lab contexts task (3.3) into WP4
    • added a new Task "Citizen science" to WP4
  • January 3
  • January 2:
    • started to flesh out Workpackage 3
    • merged Task 2.3 into Task 3.1
    • added Task 3.4 "Linked Open Data for Wikidata"
    • defined tasks in WP4
    • started to think about #Illustrations
    • article published about the proposal in The Signpost, a Wikimedia community publication
  • January 1:
  • December 31:
  • December 30:
    • moved use cases out of WP4, with the aim of preselecting a few and building the proposal around them
    • merged "optimizing openness" (from WP4) and " Identification of resources" tasks (3.1) - we will identify CC0 sources suitable for use on Wikidata, and then take a look at the reasons behind their choice of CC0, with the aim of compiling some best practice guideline
    • merged Task 2.4 Consistency management and 3.3: Data citation and provenance and Task 3.4: Review and verification into Task 3.3: Quality assurance
    • rearranged tasks in WP3
    • Public Hangout: 10pm CET (which is UTC +1; see it in your timezone)
    • redefined WP4 as being about Wikibase tools for research
    • incorporated suggestions from UPM and UPS into respective workpackages
  • December 29: administrative work; scope of workpackages
  • December 28: Whole consortium section initially fleshed out; summary to wiki
  • December 27: fine tuning of tasks, partner descriptions and EU forms; refined timeline
  • December 26: fine tuning of tasks, partner descriptions and EU forms
  • December 25: fine tuning of tasks
  • December 24: Workpackages restructured, WP2 (software development) merged into tasks in other WPs
  • December 23: refinement of tasks; Public Hangout: 10pm CET (which is UTC +1; see it in your timezone)
  • December 22: Public Hangout: 10am-noon CET (which is UTC +1; see it in your timezone)
  • December 21: refinement of workpackages and scope
  • December 20: refinement of workpackages and scope
  • December 19: switch focus of drafting to EU templates (via public GDocs); second blog post
  • December 18: Definition of partners; set up mailing list; edited project summary
  • December 17: refinement of workpackages and scope; discussions of partner roles
  • December 16: refinement of workpackages and scope; discussions of partner roles
  • December 15: refinement of workpackages and scope; discussions of partner roles
  • December 14: NYC Wikidata Workshop and Skill Share; discussions of partner roles
  • December 13: refinement of workpackages and scope
  • December 12: refinement of workpackages and scope
  • December 11: first somewhat complete description of scope of workpackages
  • December 10: started to define tasks
  • December 9: more outreach; WikiProject has 11 members; 7 institutions have publicly expressed interest in joining, several more in private
  • December 8: more outreach; WikiProject has 5 members
  • December 7: refining scope; workpackages finally get numbers
  • December 6: refinement of workpackage structure; added long-term perspective by starting WikiProject Wikidata for research
  • December 5: initial blog post; initial sketch of workpackage structure
  • December 4: launch of this wiki page out of user space

Project duration[edit]

  • 3 years

Project title[edit]

  • Official title: Enabling Open Science: Wikidata for Research
  • Short title: Wikidata for Research
  • Acronym: Wiki4R

Project partners[edit]

Partner descriptions are available via the EU forms in Google docs.

  1. Natural History Museum, Berlin (Q233098) (MfN; coordinator) - contact person: User:Daniel Mietchen
  2. Wikimedia Deutschland (Q8288) (WMDE) - contact person: User:Abraham Taherivand (WMDE)
  3. Maastricht University (Q1137652) (UM) - contact person: User:Egon Willighagen
  4. Open University of Catalonia (Q3042433) (UOC) - IN3 (Open Science & Innovation group) - contact person: Eduard Aibar
  5. Europeana (Q234110) (EF)/The European Library (Q240304) - contact person: Alastair Dunning
  6. University of Paris-Sud (Q1480643) (UPS)/ ComUE Paris-Saclay University (Q13531686) (CDS - Center for Data Science of University of Paris-Saclay) contact person: Karima Rafes
  7. Technical University of Madrid (Q25864) (UPM)/ DBpedia (Q465) (esDBpedia) - contact person: Asuncion Gomez-Perez
  • The consortium as a whole thus covers the major branches of the natural sciences, along with the information sciences and Semantic Web, the arts and humanities, the cultural and natural heritage sector, and civil society. Plus, working in the open facilitates getting feedback on whatever aspect is missing.

Associate partners[edit]

This section has now been migrated to the submission forms too.

We have received more interest from potential partners than we can accommodate in this project. We are grateful for that, since we see this proposal only as a start to a both deeper and broader engagement between the research and Wikidata communities, and encourage institutions with an interest in that to join the project as an Associate Partner. Importantly, this role is not limited to institutions eligible for funding under Horizon 2020 (Q13583472), so open for entities based outside the European Research Area (Q1377820). The caveat is that Associate Partners are not eligible to receive direct funding through the project, except for travel costs related to project meetings.

To get this process going, please sign up your institution below. Institutions that had previously signaled an interest in becoming a partner have all been moved here too.

Mailing list[edit]

A mailing list has been set up for proposal preparation:

Budget[edit]

The call suggests 2-8 million euros per funded project. We aim below that, with an estimated total effort of ca. 200 person months.

  • Shall contain provisions for
    • staff time + overhead
    • travel (including for Associate Partners)
    • materials, equipment and other resources

Advisory Board[edit]

  • We need one that consists of people that reviewers recognize as both relevant and competent. Let's aim at about 10 initially, about half of which should be from within the European Research Area (Q1377820), half from outside of it. Having more than 10 is fine.
  • role for community

Illustrations[edit]

Custom-made[edit]

  • several figures in suggestions by UPM and UPS

Images[edit]

Screenshots[edit]

For copyright reasons, screenshots may not be compatible with inclusion into the proposal, which is licensed CC BY 4.0

Wikidata UI[edit]
Reasonator[edit]
Autolist[edit]
Stats[edit]

Similar projects[edit]

The introduction has to provide an overview of the current state of relevant research. This includes appropriate mentions of initiatives with overlapping focus.

Specific use cases[edit]

  • Requires easier access to Wikidata content. Currently there's two complementary APIs, the official Wikidata API (broadly speaking, for id-based queries) and Magnus' Wikidata query (mainly, for property-based queries). These need improved documentation

Cross-disciplinary use cases[edit]

  • People
    • Researchers, collectors

Discipline-specific use cases[edit]

  • Epigraphy
    • EAGLE-wiki, a wikibase-based repository of epigraphies.
  • Agriculture
    • OpenFarm
  • Data science
    • DBpedia

Other fields[edit]

Workpackages[edit]

The content of this section is gradually being migrated into the corresponding GDocs.

WP1: Management, coordination and communication[edit]

Task 1.1: Management of the consortium[edit]
  • Task leader: MfN
  • Decision-making
    • small consortium, so simple governance structure
  • ensuring that milestones are reached and deliverables produced
    • quality assurance
      • independent external evaluator
  • Internal communication
Task 1.2: Finance and reporting[edit]
  • Task leader: MfN
    • all partners involved
  • Accounting
  • Financial reports
  • Progress reports

WP2: Semantic mapping[edit]

Task 2.1: Property profiles for item classes[edit]
  • Task leader: UPM
    • interested: UM?
    • interested: UPM
  • Wikidata covers many applications, scientific and other
  • The total number of properties applicable to a given item class is the union of the properties used for various use cases.
  • This tasks focuses on the development of VRE-specific property profiles (sets of properties and associated policies and best practices that meet the functional requirements of specific research use cases).
  • Wikidata development if property profiles require that
  • Example of user community specific profiles (cf. WP4), e.g. as in Wikidata:WikiProject Source MetaData/Bibliographic metadata for scholarly articles in Wikidata
  • Examples for history: Identification of at least three subject areas and time periods of historical significance and diverse characteristics to capture in the ontological system or systems determined to be most applicable, and prepare Wikidata entries for these subject areas using the selected ontology or ontologies.
  • see also Representing knowledge – metadata, data and linked data
  • for selected item classes, create list of properties for which statements would be expected on such items
  • Facilitate establishing guidelines on where to draw the line between data that do and do not fit into Wikidata (Notability).
    • Short term
    • Intermediate term
    • Long term
    • mention network of Wikibase instances as an alternative to putting everything into Wikidata
      • EAGLE-wiki, a wikibase-based repository of epigraphies.
      • UPS examples
Task 2.2: Semantic mapping for properties[edit]
  • Task leader: UPM
    • interested: UM?
    • interested: UPM
  • Map properties identified in 2.1 as relevant for the VRE across multiple sources relevant to the user communities
  • Map to global standards as well as to local semantics used in the data sources to be connected to Wikidata in WP3.
  • using statements on properties in general
  • especially equivalent property (P1628) and subproperty of (P1647)
  • coordinate with ontologies that are already widely used in research contexts
    • e.g. OBO/ ChEBI (see talk page)
    • DBpedia
    • Freebase
    • RoboBrain
  • for the properties identified in the property profiles in 2.1, build a pattern that allows statements about those item classes to be expressed in RDF
  • There is a need to harmonize the ontologies employed in Wikidata and elsewhere. Ideally, a single ontology may be identified that will serve this purpose, but it may also be the case that several ontological systems would be preferred, to avoid our understanding of knowledge in a given area being constrained by ontological categories.
  • Subject the Wikidata entries so developed to a range of research queries or “reasoning” analyses, to see if the ontological properties are of practical utility.
  • Establish as a deliverable a workflow containing the recommendations from this activity as to the best ontological approach for any of the use cases, and recommendations for the most effective methods for incorporating data from relevant institutions.

WP3: Integrating research resources with Wikidata[edit]

  • WP-leader: ?
    • interested: TEL
      • TEL has an open data set based on bibliographic data from national (and some research) libraries across Europe. Currently 90m records are converted into RDF. This number is growing. In this project we want to integrate this data with Wikidata.
    • interested: UPM
      • UPM (as developer of datos.bne.es) could bring bibliographic metadata from the Spanish National Library. Currently, 8 million records are converted into RDF and linked with other library catalogues such as: VIAF, Library of Congress, the French National Library, the German National Library, LIBRIS, SUDOC and Dbpedia.
      • UPM (as developer of http://geo.linkeddata.es) could bring geographic data from the Geographic Spanish Institute.
    • interested: MfN
      • involved in numerous initiatives aimed at making biodiversity research data or cultural heritage available as open data
    • interested: UM
      • contribute isotope data from the Blue Obelisk project
  • Objectives
    • increase the amount and semanticity of scientific information available on Wikidata in the focal areas defined in WP4
  • EU forms in Google doc
Task 3.1: Import into Wikidata[edit]
  • Task leader: ?
    • interested: TEL
    • interested: MfN
    • interested: UPM
  • generally requires approval from Wikidata community (ideally via relevant WikiProject)
  • should build on the property profiles identified in 2.1 and populate items with corresponding statements, importing necessary information from external databases as appropriate.
  • may require development work (e.g. for bots)
  • may require mapping work (e.g. creating data models) - WP2
  • with provenance - Task 3.3
  • may require verification - Task 3.3
  • some data from the H2020 Open Data pilot
    • Freebase integration as a test case
  • Map identifiers of items across multiple sources relevant to the user communities
    • including DBpedia
  • example: ethanol (Q153)
  • statements and qualifiers
  • for the classes identified in 2.1, apply the patterns created in 2.2 to existing Wikidata items belonging to those classes and to newly created items (see also consistency management, 3.3)
Task 3.2: Quality assurance[edit]
Data citation and provenance[edit]
  • Facilitate establishing guidelines around provenance and data citation
  • use standards like W3C PROV
    • see discussion by Paul Groth here
  • Freebase integration as a test case
Review and verification[edit]
  • Develop Wikidata-based professional review and verification mechanisms
  • checking against third-party databases
    • couple this with other forms of expert review, ideally with technical support, e.g. via annotations
  • signal to Wikidata users any available measures of trust in a given statement
Consistency management[edit]
  • Consistency checking across multiple sources of data or vocabularies
  • identify cases where multiple databases agree
    • need original source
  • identify cases where they do not agree, and then sort out why
    • if this is systematic, it might get people engaged
  • example: is this image depicting black death or leprosy?
  • pingback mechanisms to original sources
  • propagation to DBpedia
Task 3.3: Optimizing openness[edit]
  • Task leader: ?
    • interested: MfN
Identification of sources of CC0 data[edit]
Analyzing motivations for CC0 licensing[edit]
  • Optimizing the general societal benefits of Wikidata enabled research.

The sharing of data and other resources is an integral part of research endeavours. In the Web age, most new research objects are digital, and many legacy ones are being digitized. Once digital, they can be easily shared over the Web, and from there, it is technically only a very small step towards opening them up for reuse by a potentially global and cross-disciplinary audience. Socially, though, this step is larger, and few incentives beyond altruism exist for institutions, research groups or individuals to fully embrace openness.

This task is concerned with identifying benefits that accrue to those who share their research openly. Mid- to long-term effects of openness have been the subject of prior investigations that established, for instance, citation advantages for open-access articles or for publications associated with open data or open-source software. The situation is much less clear for immediate and short-term benefits, but if such benefits exist, an analysis of best practices around them can help harness their potential for data providers, the wider research community, and society at large.

WP4: Enabling the use of Wikidata in research contexts[edit]


Linked Open Data for Wikidata[edit]
  • Task leader: UPM

There are multiple ways in which Wikidata could be integrated with the Linked Open Data. This task will explore two of them: Linked Data Fragments, a SPARQL endpoint. A third one — mapping through DBpedia — is explored as part of the mapping to external databases in WP2.

Linked Data Fragments[edit]
  • The basic idea here is to take some of the load off from SPARQL endpoint servers to the client side by making intelligent use of dumps.
  • A demo for Wikidata already exists at http://wikidataldf.com/ but needs work in order to become fit for research needs.
Basic SPARQL endpoint[edit]
Editing via the Wikibase API[edit]
Wikidata identifiers in the lab[edit]
  • task leader: UPS
  • Extend Wikibase to connect the private Wikis in the laboratories and the identification keys in Wikidata
  • OK for Center for Data Science : interconnect the different laboratories' data via the concepts in Wikidata and internal wiki for human and via a SPARQL endpoint for the machine (in WP4). contact person: Karima Rafes
Citizen science[edit]
  • Task leader: MfN
Wikidata for cultural heritage[edit]
  • Task leader: Europeana

WP5: Dissemination and stakeholder engagement[edit]

  • WP-leader: UOC
  • Objectives
    • Education about using Wikidata as a VRE and how to collaborate with the community (Wikimedia-culture, WikiProjects)
  • EU forms in Google doc
Task 5.1: Dissemination[edit]
  • Task leader: MfN
  • moved to Google doc
Task 5.2: Community engagement[edit]
  • Task leader: MfN
    • all partners involved
  • moved to Google doc
  • Wikidata community
    • participation in Wikimedia hackathon and Wikimania
      • possibly with workshops
    • project meetings as satellite meetings to above
  • Research community
    • Publications
      • Drafted and reviewed as openly as possible
      • includes publication of this research proposal
    • participation in scientific meetings focused on data and identifiers
      • possibly with workshops
  • Developer community
  • Meetings at the interfaces between the researcher, Wikimedia and developer communities
  • Upon submission, proposal shall be put on Zenodo and possibly published in a more formal way
  • Forking encouraged
Task 5.3: Development of tutorials[edit]
  • Task leader: UOC
  • moved to Google doc
Task 5.4: Development of course materials[edit]
  • Task leader: UOC
  • moved to Google doc
Task 5.5: Development of a MOOC[edit]
  • Task leader: UOC
  • moved to Google doc
Task 5.6: Organization of training events[edit]
  • Task leader: UM
  • moved to Google doc

Notes[edit]

Here, information is dumped that may be useful in drafting but for which no better place could be found so far.

Ideas not to be included in this proposal[edit]

This section collects activities that have been considered for inclusion into the project but determined to be out of scope. We are keeping them here for the time being in case some aspect thereof should become relevant during further development of the proposal.

  • Update management/ version control
    • solved for item/ property etc. pages via MediaWiki
    • out of scope for ontologies?
  • Wikibase outside Wikidata
    • Wikimedia Commons
      • basically covered by Wikidata roadmap
    • elsewhere
      • out of scope?
  • annotations
  • out of scope: activities not related to research
  • won't do: dumps, because mw:Wikidata Toolkit exists
  • Wikidata would be well suited to store legislative data. The wording of laws is very formulaic and could be made even more understandable through a relevant ontology
    • out of scope?
  • as a complement to pulling info into Wikidata, nanopublications may provide a push mechanism
  • ...
  • Lexicalization
    • a tool - perhaps as part of Wikidata games - as per Task 5 in UPM suggestions
    • creating natural language text, e.g. for bot-created Wikipedia articles, or database content more generally
Visualizations[edit]
Content mashups[edit]
Identification keys[edit]
  • Task leader: ?
    • interested: MfN
  • Build a framework for identification keys
    • e.g. for minerals, chemicals, taxa, developmental stages, historical figures
  • see also Wikidata:WikiProject Identification Keys
Identification keys for research outputs[edit]
Identification keys for chemistry[edit]
  • Identification keys for small molecules
    • interested: MU
  • Identification keys for Analytical Chemistry
  • Identification keys for metabolic pathways
    • interested: MU
  • Identification keys for minerals
    • interested: MfN
Identification keys for taxa[edit]
  • interested: MfN
Versioning[edit]
  • consider versioning effects of Wikibase and external data or vocabularies
    • simple and robust solution: via versioned dumps
    • theoretical alternative: via version history of Wikidata pages
    • perhaps try out on small scale, e.g. with data extracted from living meta analyses
    • out of scope?
Improve existing code[edit]
  • can't be a task of its own, but has to be kept in mind
  • code hardening, updating and enhancement of community-developed tools and gadgets
  • Wikidata game

Risk assessment[edit]

  • required in several ways
  • see GDocs for respective workpackage

Impact[edit]

  • mention tight integration with other Wikimedia projects
  • stress existing VRE aspects of Wikidata
  • mention DBpedia as a hub for research and centerpiece of the Linked Open Data interconnections (Graph 2014)
  • mention Freebase's merge into Wikidata
  • mention multilingual
  • mention GLAM-Wiki/ OpenGLAM
  • mention the Letters of support
  • mention open drafting
  • mention forkability
  • manage inconsistencies
  • mention Wikidata:Glossary
  • L. Candela, D. Castelli, P. Pagano (2013) Virtual Research Environments: An Overview and a Research Agenda. Data Science Journal, Vol. 12, p. GRDI75-GRDI81 DOI: http://dx.doi.org/10.2481/dsj.GRDI-013
    • "The lack of community uptake has cascading effects on the entire VRE research domain, in particular its impacts on sustainability."
    • "services and resources that are aggregated and offered by such infrastructures should, as much as possible, be independent of a specific application domain and “designed for reuse”."
    • "Actually, Virtual Research Environments should be linked to existing infrastructures with both roles of consumer, i.e., VREs should benefit from the services offered by these infrastructures, and provider, i.e., the resources produced in the context of the VRE operation should contribute to the infrastructures offering."
    • "Virtual Research Environments should be designed, since the beginning, to promote uptake, ensure usability, and guarantee sustainability."
    • "As regards usability, Virtual Research Environments building should be mainly a community building process rather than a technology development process. This implies that the focus should be primarily on using technology to identify and rationalise workflows, procedures, and processes characterising a certain research scenario rather than having technology invading the research scenario and distracting effort from its real needs. As far as sustainability is concerned, it is fundamental that the resulting VRE service is conceived as a vital tool in the community of practice it is dedicated to. Moreover, sustainability is further enhanced whenever the VRE is perceived as a useful tool in the context of larger research initiatives and communities so to benefit from economies of scale, i.e., savings gained by an incremental level of production, and economies of scope,i.e., savings gained by producing two or more distinct goods when the costs of doing so is less than that of producing each of them separately."
    • these notes about the paper are now incorporated into the Concept and approach section

Open drafting[edit]

  • The very fact of drafting the proposal in the open creates a level of community engagement that can rarely be found in the contexts of research projects not yet funded.