Wikidata:WikidataCon 2017/Notes/Wikibase: How to survive the install and data normalization to get pretty research information

From Wikidata
Jump to navigation Jump to search

Title: Installing Wikibase

Speaker(s)[edit]

Name or username: Laura Hale

Contact (email, Twitter, etc.):@purplepopple, laura@fanhistory.com

Useful links: https://github.com/mparaz/parasports-data-import

Abstract[edit]

One of the greatest potentials for Wikidata is to make the underlying software more accessible, and to encourage more people to use it for their own research projects and large scale databases. Doing so would bring additional developers to the community to build tools that benefit academics, other researchers and free culture specialists to support this work, which in turn will assist the Wikidata community by bringing in additional people who may not otherwise have decided to use Wikibase.

This talk will focus on the technical aspects of doing a Wikibase and Query Engine install, issues around data normalization and data importing, creating queries to make use of data, and other challenges associated with this activity in the context of building a large disability sport database. The first part will focus on some of the issues discovered with the install, provide guidance on how to complete tasks where documentation is lacking, and explain some places where things can go wrong. The second part will focus on data issues, including how to structure data, and gathering and importing data when this information is not already in a structure format. The last part will focus on issues related to the query engine, including some basic tips on writing queries, and how easily you can crash your query engine and Wikibase install by formulating bad queries.

Collaborative notes of the session[edit]

"We did the first successful installation of Wikibase with a Query Interface outside of the WMF Wikis that we know of." (Roughly transcribed from memory)

Installation of Wikibase is tricky, a lot of component have been installed to fully use it: MediaWiki, Wikibase Extension, Blazegraph, QueryInterface, Lua ...

SPARQL prefix statements

Documentation missing or outdated (technical)

Problems:

  • Cleaning bulk data
  • Data normalisation
  • Disambiguation: different Sint-Petersburg
  • Data completeness
  • Data normalisation is the *most* time-consuming part of your work.

Questions / Answers[edit]

Q: Providing something like Docker package

A: now doing some automation (or so)

A: (Miguel) - yes, we can do Docker... I was thinking more along the lines of automatic deploy to a cloud provider but Docker works too!