Wikidata:WikiFactMine/Dictionaries and PagePiles

From Wikidata
Jump to navigation Jump to search

While the SPARQL query is the workhorse for dictionary creation, as in many other areas around Wikidata, it does not cover all the possibilities. A case study illustrates why further techniques, and the use of PagePile, may be necessary. The Petscan tool in a sense "federates" the Wikimedia sister projects, and is able to draw data from both categories and template. ContentMine dictionaries based on Wikimedia may therefore be derived in different ways.

Role of PagePile[edit]

PagePiles can be created "by hand", or as the output of tools such as Petscan. A PagePile once created must be converted into a list of Q-numbers. This can be done at https://tools.wmflabs.org/pagepile/?menu=filter.

A suitable PagePile may then be entered into the aaraa tool "Add to Dictionary" box, once a dictionary name has been entered by "Create New"; and the new dictionary may be downloaded, just as in the SPARQL case.

The other PagePile filters may be used to build up complex combinations of lists into custom dictionaries. For more idea of the scope of the PagePile tool, see the original blogpost.

Filter and combine[edit]

On https://tools.wmflabs.org/pagepile/?menu=filter there are these Boolean options:

  • Union
  • Subset (i.e intersection)
  • Exclusive (i.e. complement of intersection)

There are also wiki manipulations

  • Follow redirects
  • Filter namespace
  • To Wikidata
  • From Wikidata
  • No Wikidata

One other option is "Random subset".

Inputs[edit]

There are therefore a number of ways to combine PagePiles. From the point of view of dictionary manipulation, https://tools.wmflabs.org/pagepile/?menu=new offers a number of ways to operate. In particular any hand-compiled list of pages or items can become a PagePile with a stable identifier.

From dictionary to PagePile: and back?[edit]

The JSON files that comprise ContentMine dictionaries can be pasted as inputs into

https://tools.wmflabs.org/pagepile/?menu=new

to render as a (simpler) JSON file consisting of Wikidata item Q-numbers only. This simple transformation connects dictionaries to a wider range of tools.

Since a PagePile of Wikidata items can have its number pasted into the aaraa tool, the process can be reversed. This means that making the PagePile may lose nothing. On the other hand, there are caveats.

The content of the dictionary can be thought of consisting of pairs such as (Q1234567, label). From Q1234567, what can be reconstructed is the current main label of the Wikidata item. If "label" was the main label, and that main label has not in the meantime been edited, it will be recovered. If it was another alias, it cannot necessarily be reconstructed. What can be added back is the collection of all English aliases (or aliases in a given language, or all languages).

Obviously enough, manipulations around PagePile intended to end up with a manipulated dictionary have to work around the caveats.

Reproducibility[edit]

Using a SPARQL query to create a dictionary means that the query can be run again later. Then the set of results may differ, because Wikidata is a dynamic site. A PagePile created from the initial run can act as a baseline: the union of two "exclusive" filters can show up the "diff" or XOR of the two runs. To check what has happened on Wikidata, in relation to the particular query, one can verify on item histories why there have been additions to or subtractions from the results list.

More generally, "tweaks" and Boolean manipulations on dictionaries can be represented as recipes using given PagePiles. If the provenances of hand-edited lists are also recorded, dictionary production can be followed and audited by others. If Petscan is used, its query identifiers JSON export also allow reproducibility.

Recording PagePile use[edit]

For an example, see Wikidata:WikiFactMine/Terpenoids case study#Formal record

The way the PagePile "filter and combine" feature works is just the same as a calculator employing Reverse Polish notation. In other words, combinations may be recorded as, for example

PP1 PP2 union

where PP1 and PP2 stand for the PagePile numbers involved, and the postfix union states the filter applied to the pair. Reverse Polish notation requires no parentheses:

PP1 PP2 union PP3 exclusive

parses unambiguously, meaning that the union of PP1 and PP2 was carried out, then the exclusive operation with PP3. The "diff" or XOR mentioned above would be

PP1 PP2 exclusive PP2 PP1 exclusive union

For a full record, the PagePiles can be documented:

where PP1 = output of query1
      PP2 = ...

NB: The operations discussed above are binary Boolean operations. The "wiki manipulations" list consists of unary operations. For consistency with reverse Polish, the Wikidata list of items for a PagePile PP1 of Wikipedia pages should be represented with a notation like

PP1 wd

and used only in "where" statements.

Developing search strategies[edit]

If hand compiled lists are also noted in full, a complete "search strategy" can be documented using terms of PagePile.

In fields such as patent search, such documentation is standard. It helps deal with issues of recall (see w:precision and recall): if some expected search result is missing for unclear reasons, an audit can be done. A developed WikiFactMine search strategy is expected to look something like

initialdictionarypass stopwords exclusive tfidflist exclusive handlist union

That is, a dictionary as submitted will have a standard stop words list removed; and then be processed somewhat in line with tf–idf reasoning to remove, possibly, some very common terms for performance reasons. Then experience with recall may lead to a handlist being added, perhaps including a partial set of aliases for some of the removed terms.

tf–idf surrogate[edit]

The original setting for tf–idf is a matrix where rows are search terms, columns correspond to documents out of a fixed set, and entries count the mentions of a given search term in a given document. In WikiFactMine this becomes the matrix counting hits of a given dictionary term in the current set of papers on which fact mining has been carried out. Columns will be added day-on-day.

Various tf–idf formulae are used, giving summary information for rows. The main reason to look at such numbers, in the WikiFactMine context, would be to screen out dictionary terms that are very common. For example, in a tropical diseases dictionary, "malaria" might well occur with a very high frequency when compared with other diseases. For performance reasons, to pick up better on other diseases, it could be convenient to screen out malaria from the dictionary.

If the a(i), i = 1 to N, are the row entries for a given dictionary term, then

(∑ a(i))2/∑ a(i)2

can be calculated from the row. It will take values at least 1, and larger values when the hits are more evenly spread across the columns. (If there are 100 hits, the extreme cases are 100 when they occur as singletons, 1 when they are all in one paper.) High values reflect a term that occurs often but in a relatively diffuse fashion in the literature being searched.