Wikidata:WikiFactMine/New dictionaries and aaraa tool

From Wikidata
Jump to navigation Jump to search

The role of dictionaries in WikiFactMine is to find and control "mentions" of terms in the text of papers. Dictionaries are simple data structures, in themselves. They supply the need for systematic keywords.

From another point of view, dictionaries and their construction are fundamental to the project. Good results cannot be obtained by searching, unless the search terms used are well chosen. Therefore the project sees the need to work with domain experts.

This page gives information about constructing dictionaries, by means of https://query.wikidata.org, and https://tools.wmflabs.org/pagepile: in other words with Wikidata's versatile SPARQL endpoint, and the PagePile utility written by Magnus Manske. These approaches may be used in the aaraa tool. See Dictionaries and PagePiles for further discussion.

Motivation[edit]

In August 2017 WikiFactMine started to renew its dictionaries used to search for facts. The aim is to move ahead with dictionary (facet-based) search through facts mined, tailored to end-users, and beginning a migration to more reliance on Wikidata and SPARQL. That is a contrast to the provenance of the dozen legacy dictionaries.

Role of SPARQL[edit]

An early source is Chris Kittel's post Building a new facet from wikidata, for a discussion of building a dictionary from a Wikidata query, with code details. The particular SPARQL query used was not crucial there.

The title suggests the connection with faceted classification (Q1391014) and faceted search (Q1519370). In turn that helps answer the question: why not just one large dictionary? From a search point of view, a long search can be broken up or segmented into several batches. Each of those segments is a kind of dictionary.

Use of SPARQL and PagePiles in aaraa[edit]

To enter SPARQL into the aaraa tool, you must first enter a reasonable dictionary name, of lowercase letters only, no spaces, in the box to the left of the blue "Create New" button. Click the button, and then the "Add to Dictionary" button. Paste your SPARQL query in the upper box, and the click "Run Query".

NB: For use in the aaraa tool, the SELECT (or SELECT DISTINCT) initial line of the query used must contain only ?item.

Then wait for a listing to appear. This may take a couple of minutes, with a query having many hits. When the listing appears, the dictionary created can be saved to your machine as a JSON download. For that, click on the "Prepare Dictionary" button. When the "Download" link appears next to it, click on that.

The dictionary can also be explored on aaraa, by initial letter.

A PagePile numerical ID may be used in the same way, pasted in the lower box.

Aliases[edit]

A feature that should be available soon is to add all English-language aliases to a dictionary. This would be equivalent to having ?itemAltLabel standing in the first line, but restricted to those coming from "en" as language.