Wikidata:WikiFactMine/Facets

From Wikidata
Jump to navigation Jump to search

New dictionaries are replacing those originally in use. A range of queries suitable for creating new dictionaries has been developed at Wikidata:WikiFactMine/Core SPARQL. (That page also contains some graphics examples, and "utility" queries. These are both marked as "auxiliary".)

Major points about them are:

  1. Finer-grained. In terms of Wikipedia categories, which are in some ways analogous, we aim to do "diffusion". For example, break up locations by continent.
  2. Rational, dynamic reconstructions via SPARQL. Using SPARQL and Wikidata to generate dictionaries automatically builds in updated information. Federation of SPARQL queries around Wikidata means, going forward, we will not be limited to information drawn from Wikidata.
  3. Free structure. Wikidata's coverage is still patchy in parts, and a dictionary at present can consist of a SPARQL query and an appended list. The list can plug gaps on a temporary basis.
  4. Transitional. We think in terms of syncing our dictionaries and Wikidata, by improving Wikidata (with new items, aliases and statements), in a win-win way, as we gain experience. This will allow piecemeal improvement of the SPARQL dictionaries, and eventual retirement of the legacy static dictionaries. (Some matching work could go on here, but by case-by-case decision.) This programme should allow for a rolling change in the basic dictionaries used by WikiFactMine, going forward. But we have to learn the ground first. We can take on board WikiProject input as we do it, as well as reflecting on current content of the published papers we scrape.
  5. Playing to strengths. We can expect good recall in the "long tail" situation. For example for two dictionaries, insects and plants, expect more interest from the rain forest situation (many beetles and shrubs), than in the agribusiness context of a small number of pests and grain species.
  6. Digging holes in dictionaries. A conclusion is that dictionaries like "mammals without homo sapiens", or "tropical diseases without malaria", might be a good idea. If "human" and "malaria" are dominant in hits, we should just exclude them for custom searching. And so a second list, of stop word type, may be appended in due course to filter SPARQL dictionaries. (This could be hand-fixed in SPARQL, but it is better explicitly noted.) This for a revised version, not the first, though.

Some pairs of dictionaries will work much better than others, in finding candidate facts for Wikidata. To deal with the issue will likely involve gamification, allowing easy skipping of uninteresting facts, and a salami-slicing approach of tweaking dictionaries to tune them.

With 100 dictionaries, an initial target, there would be thousands of pairs to try, so we'll be selective, trying to find good candidates to offer: for example as "levels" or suchlike in a game.

Multiple facets[edit]

They are part of an emerging bigger picture of faceted search.

In a query, the combination of statement is naturally conjunctive. The lines in the "WHERE" part (graph pattern) are to be read

L and M and N and ...

If we want rivers or canals though, we should run two queries and join the lists of hits. At least that is natural, for a disjunctive problem. Having a river dictionary and canal dictionary when looking for facts gives in principle different facets to the topics found in a text. That is as well as being handy.