Wikidata:WikiFactMine/Terpenoids case study

From Wikidata
Jump to navigation Jump to search

This page explains a case study on building a terpenoids dictionary. It explores dictionary-building techniques that go beyond the SPARQL query.

Terpenoids[edit]

According to the English Wikipedia, the terpenoids (Q426694) concept is key to understanding naturally-occurring chemicals: they are "the largest group of natural products". Finding a representative list of them, though, is not straightforward in Wikidata terms. A simple query shows that only a couple of compounds here are given as instance of (P31) terpenoids (Q426694), while (at the time of writing) terpenoids (Q426694) has no useful subclasses.

Zeaxanthin, a terpene alcohol that gives paprika its colour

First steps[edit]

To bring in the English Wikipedia category system as a support, one can use a Petscan query. These queries are saveable searches with their own permanent number. This one searches en:w:Category:Terpenes and terpenoids to depth six, finding around 1000 hits. The output page allows export to a PagePile.

That query contains much that is not a natural product, though. en:w:Category:Steroids is a subcategory, and numerous steroids do not come as natural products. A modified Petscan query takes out the whole steroid category (with subcategories), leaving around 400 compounds. (NB Anything not showing up as a chemical compound on Wikidata is being excluded here by some SPARQL, in a box on the "Other sources" page.)

Replacing natural products[edit]

Some steroids do occur as natural products, though, such as testosterone. Having taken all of them out by fiat, it would be appropriate to put back those with Wikidata items having found in taxon (P703). Indeed, for the purposes of listing terpenoids, adding such statements is clearly a step in the right direction. A special "natural steroids" query picks up over 50 steroids by enWP subcategory, moderated by SPARQL saying that the Wikidata item is a compound with found in taxon (P703).

At this point, we clearly want to "put back the natural steroids", and here we can rely on PagePile, to take the union of piles corresponding to the latter two queries.

There is in fact a quite different way to use Wikipedia categories on this problem. The German Wikipedia's de:Kategorie:Terpenoid does not include de:Kategorie:Steroid: both are subcategories of de:Kategorie:Lipid. A Petscan query based on subcategories of the Terpenoid category produces over 200 hits.

Formal record[edit]

Wikidata:WikiFactMine/Dictionaries and PagePiles suggests a style of formal recording (reverse Polish), for constructions with PagePile. Here's how it applies.

The process in PagePile begins with

10311 = 10310 wd (i.e. list of 418 enWP pages filtered to Wikidata), still 418 hits

where

10310 = output of PSID 1181824, Petscan query based on enWP category excluding steroids, 418 hits on 2017-08-22.

Then

10313 = 10312 wd (56 hits)

where

10312 = output of PSID 1184961, Petscan query based on natural products

Also

10315 = 10314 wd (232 hits from a list of 236 pages on deWP)

where

10314 = output of PSID 1212768, Petscan query based on de:Kategorie:Terpenoid

The four missing pages

10319 = 10314 nowd

there from deWP turned out to be redirects.[1]

The final list

10318 = 10311 10313 union 10315 union

came out to 585 entries; the intermediate PagePile 10317 had 474, implying that 10311 and 10313 were indeed disjoint.

Notes[edit]

  1. de:Yangonin (see yangonin (Q8048635)), de:Epigallocatechin (see epigallocatechin gallate (Q393339)), de:Epicatechingallat (see epicatechin gallate (Q5382492)), de:Phytocannabinoide (not present). There seems no chemical justification for including the first three in a terpenoids dictionary, since they belong to other groups of naturally-occurring chemicals. If there had been, they could be added by hand to a dictionary, or (better from a formal point of view) made into a short PagePile to add.