Wikidata:AddScholarTopics Script

From Wikidata
Jump to navigation Jump to search

General idea

[edit]

The script adds main subject (P921) to scholarly articles.

This enable the topic based search in tools such as Scholia.

The script use a list of keyword / item pairs where we can assume that, if the keyword is in the publication's title, one of the main topics of the publication is the item.

For example :

Keywords with an homonym and short acronyms shall never be added to the list. Unfortunalty, this process will not work for them.

The bot will only add the claims to the items if the title has an exact match for the item. If you are familiar with regex you can extend this behaviour by adding regular expressions for the keyword.

For each new pair, the consistency of result of the query is to be checked manually by:

  • the contributor before adding it to the below list,
  • the bot operator before running the bot.

The metadata of scholarly articles in Wikipedias are virtually impossible to maintain by hand because the rate of creation of these articles exceed the capacity and willingness of the community to set and maintain such data. So that adding such controlled automation is the only way we can maintain the data.

Before adding a keyword / topic pair to the list !

[edit]

The request

[edit]

For each line of the keyword / topic dictionary, the following request is run.

Please manually run the request on you keyword / topic pair and check consistency before adding it to the list.

The following code has been contributed in this request for Query

SELECT DISTINCT ?item ?itemLabel 
WHERE {
  hint:Query hint:optimizer "None".
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:api "Search";
                    wikibase:endpoint "www.wikidata.org";
                    mwapi:srsearch "keyword haswbstatement:P31=Q13442814".
    ?title wikibase:apiOutput mwapi:title.
  }
  BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item)
  FILTER NOT EXISTS { ?item wdt:P921 wd:Q202864. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } 
}
Try it!

Details

[edit]
mwapi:srsearch "keyword haswbstatement:P31=Q13442814".

"Keyword" is to be replaced with searched names. Please notice that, if you search for 2 words, it will find out any label having the 2 words independantly of them place in the label. A function called "error shield" will later ensure exact matchs with keywords or regular expressions before adding the claims to the items.

instance of (P31) has to be settled to scholarly article (Q13442814) because we are only working on scholarly articles here. All articles massively imported from PubMed or them DOI have this settled. It could probably be extended to other type of scientific publications.

FILTER NOT EXISTS { ?item wdt:P921 wd:Q202864. }

Ensure that main subject (P921) is not already settled to the targeted item. Q202864 is then to be replaced by your target Qid.

The Keywords / Topics dictionary

[edit]

The following dictionary contain Keywords / Topic's Qid pairs.

It works by pairs of strings only, in order to keep it simple.

Before adding to this list, it is mandatory to test run the request above and to ensure the consistency of data.

Please respect the syntax. If this page is broken the bot will not launch.

    '':'',

The list use alphabetic order. However, you can temporary add an item at the top to ensure it is treated first.

{
    'anandamide':'Q410228',
    'adenosine triphosphate':'Q80863',
    'astrocyte':'Q502961',
    'aluminum':'Q663',
    'aluminium alloy':'Q447725',
    'avitaminoses':'Q194435',
    'Behçet disease':'Q911427',
    'borrelia burgdorferi infection':'Q201989',
    'cannabichromene':'Q410949',
    'Cannabicyclol':'Q907909',
    'cannabidivarin':'Q1104117',
    'cannabidiol':'Q422917',
    'cannabinoid':'Q422936',
    'cannabinoid receptor':'Q421237',
    'cannabinol':'Q265831',
    'cannabigerol':'Q412122',
    'chelation therapy':'Q1069061',
    'chromium element':'Q725',
    'chronic cancer related pain':'Q58490762',
    'chronic pain':'Q1088113',
    'chronic postsurgical pain':'Q58490799',
    'chronic post traumatic pain':'Q58490799',
    'chronic neuropathic pain':'Q58490835',
    'chronic neurogenic pain':'Q2798704',
    'chronic neurologic pain':'Q58490835',
    'chronic primary headache pain':'Q58491426',
    'chronic primary orofacial pain':'Q58491426',
    'chronic primary musculoskeletal pain':'Q58491405',
    'chronic primary pain':'Q58490715',
    'chronic primary visceral pain':'Q58491345',
    'chronic secondary musculoskeletal pain':'Q58490917',
    'chronic secondary visceral pain':'Q58490904',
    'chronic widespread pain':'Q58491385',
    'cobalt':'Q740',
    'cortisol':'Q190875',
    'curcumin':'Q312266',
    'cyanocobalamin':'Q252251',
    'dextro naltrexone':'Q63100609',
    'diagnosis':'Q16644043',
    'diagnostic':'Q16644043',
    'diagnostic procedure':'Q177719',
    'dopamine':'Q170304',
    'dynorphin':'Q324076',
    'Ehlers Danlos':'Q1141499',
    'endorphin':'Q190528',
    'enkephalin':'Q325101',
    'fibromyalgia':'Q540571',
    'glucose':'Q37525',
    'heavy metal intoxication':'Q19904193',
    'heavy metal poisoning':'Q19904193',
    'heavy metal toxicity':'Q19904193',
    'heavy metal toxicosis':'Q19904193',
    'hexavalent chromium':'Q2660666',
    'imidazoline':'Q419555',
    'interleukin':'Q194908',
    'internal medicine':'Q11180',
    'intoxication':'Q18621601',
    'iridium':'Q877',
    'low dose naltrexone':'Q5259325',
    'Lyme disease':'Q201989',
    'Lyme borreliosis':'Q201989',
    'magnesium':'Q660',
    'magnesium sulfate':'Q288266',
    'medical cannabis':'Q1033379',
    'medical genetics':'Q1071953',
    'melatonin':'Q180912',
    'metal intoxication':'Q4215775',
    'metal poisoning':'Q4215775',
    'metal toxicity':'Q4215775',
    'metal toxic':'Q4215775',
    'metal toxicity':'Q4215775',
    'metal toxicosis':'Q4215775',
    'microglia':'Q1622829',
    'microglial inhibitor':'Q63100844',
    'microglia inhibition':'Q63100844',
    'neuroborreliosis':'Q201989',
    'neuroglia':'Q177105',
    'neuroinflammation':'Q17157137',
    'neuronitis':'Q17157137',
    'neurogenic pain':'Q2798704',
    'neurologic pain':'Q2798704',
    'neuropathic pain':'Q2798704',
    'nutrition disorder':'Q1361144',
    'nickel':'Q744',
    'nociceptin':'Q4327722',
    'norepinephrine':'Q186242',
    'obesity':'Q12174',
    'opioid':'Q427523',
    'overweight':'Q12174',
    'pain management':'Q621261',
    'pathology':'Q7208',
    'poisoning':'Q114953',
    'psychiatry':'Q7867',
    'serotonin':'Q167934',
    'stainless steel':'Q172587',
    'tachykinin receptor':'Q426034',
    'titanium':'Q716',
    'tetrahydrocannabinol':'Q190067',
    'toll like receptor':'Q408004',
    'toxic heavy metal':'Q19904193',
    'tumour necrosis factor':'Q18032037',   
    'vasopressin':'Q12009087',
    'vitamin':'Q34956',
    'vitamin deficiency':'Q194435',
}

Exclusions

[edit]

The following dict is an exclusion list :

  • The keys are scientific articles Qids
  • The values list contain topics Qids that should not be added

It will very probably only be usefull in very specific cases

{
    'Q46788624':['Q797668','Q5384031']
}

Using regular expressions

[edit]

The following dict is for items that can be written in lots of different ways.

By default, in order to avoid wrong attributions, the property is only added if the exact match for the keyword has been found.

Since the search function return results even if there is no exact match, this is implemented by an "error shield" function in the bot that check for exact match using regex.

One way of dealing with this is to add all the possible ways of writting the item name in the keywords/Qid list. However, if you are familiar with regular expression you can use the following list to speed up the process.

Tools to help with regex :

Please use only lower letters in this dict as the strings are converted to lower letters before checking.

Please double check the Regex before saving the page as it can generate lots of errors in Wikidata in for some typos.

{
    'chronic pain':'chronic.*pain',
    'vitamin a':['vitamine?s?[ \-_]?a[1-9]*','a[1-9]*[ \-_]?vitamine?s?'],
    'vitamin b':['vitamine?s?[ \-_]?b[1-9]*','b[1-9]*[ \-_]?vitamine?s?'],
    'vitamin c':['vitamine?s?[ \-_]?c[1-9]*','c[1-9]*[ \-_]?vitamine?s?'],
    'vitamin d':['vitamine?s?[ \-_]?d[1-9]*','d[1-9]*[ \-_]?vitamine?s?'],
    'vitamin e':['vitamine?s?[ \-_]?e[1-9]*','e[1-9]*[ \-_]?vitamine?s?'],
    'vitamin b6':['vitamine?s?[ \-_]?b6[1-9]*','b6[1-9]*[ \-_]?vitamine?s?'],
    'vitamin b12':['vitamine?s?[ \-_]?b12[1-9]*','b12[1-9]*[ \-_]?vitamine?s?'],
    'vitamin k':['vitamine?s?[ \-_]?k[1-9]*','k[1-9]*[ \-_]?vitamine?s?'],
}

Note :

'chronic pain':'chronic.*pain',

Will match any title that have both chronic and pain in this order. This type of regex is very powerfull. Use carefully !

Launching the bot

[edit]

The bot is only runned manually by an operator.

Ask an operator to run the bot after editing the list.

Operators will double check your edits before launching the script.


Current operators are :

Operator account Use bot account
User:Thibdx User:Tdbot

Source code

[edit]

The bot is open source and can be found here : https://paws-public.wmflabs.org/paws-public/User:Tdbot/addScholarTopics.ipynb

Feel free to fork and adapt this bot.

/!\ You are responsible for any edit made by a bot you launched. Please read Wikidata:Bots and request permission before operating a bot.

Ideas for improvements

[edit]
  • Get topics from the metadata of PubMed and other major sites.
  • Once a good set of data is settled, deep learning could be tested to see if it helps to feel the blanks. This could be based on the data retrieved but also the article abstract and the full text when available.
[edit]

Examples

[edit]
[edit]

Paws

[edit]

Minimal environent to run Piwikibot within a notebook.

Toolsforge

[edit]

Full featured server.

Guerrit

[edit]

Wikimedia's git

Phabricator

[edit]

Project management and code review

General Wikimedia dev

[edit]