Wikidata:Requests for permissions/Bot/NPImporterBot
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved --Lymantria (talk) 07:13, 3 December 2020 (UTC)[reply]
Contents
NPImporterBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Bjonnh (talk • contribs • logs)
Task/s: The objective of this bot is to import missing molecules, organisms and references for the Wikidata Chemistry Natural products project.
Code: The bot is made in Kotlin (Q3816639) using the Wikidata toolkit and its code available on GitHub
Function details:
The bot has the following functions:
- Check if a molecule already exist, add an entry if it does not (uses InChIKey (Q21445422)/InChIKey (Q21445422)/International Chemical Identifier (Q203250)).
- Check if a taxon already exist, add an entry if it does not (uses GBIF and other databases IDs plus name).
- Check if a scientific article already exist, add an entry if it does not (uses DOI only).
- Annotate the compound with the property natural product of taxon (P1582) to link the compound with the taxon and the scientific article (added as a reference to this property)
- The bot does not have the ability to delete or change already existing data.
We have a manually created entity here (using valid data): [1]
And an automatically created entity here on test.wikidata.org by this bot (real data as well): [2] , the data was created using another account, but it was not named specifically enough for a single project. I made a new user just for this bot.
Ideally we would also like this bot to upload the structure of the molecules to WikiMedia and link them to the molecules here.
Scheduling and triggering
[edit]The bot is scheduled to run when we have sufficient new data to add, or regularly if we really have a regular influx of data. It will be triggered manually and data curated and validated before starting it.
Matching of existing data
[edit]Taxa
[edit]We use a SPARQL query that checks for the taxon rank and taxon name. We use and add the identifiers from the following databases (in order of preference):
Database | Used for matching |
---|---|
GBIF | Yes |
NCBI Taxon | Yes |
ITIS | Yes |
Index Fungorum | Yes |
IRMNG | Yes |
World Register of Marine Species | Yes |
VASCAN | Yes |
GBIF Backbone | Yes |
*Following DBs are not used for matching, not sorted, used only for references* | |
AmphibiaWeb | |
ARKive | |
Biolib.cz | |
BirdLife International | |
Encyclopedia of Life | |
EUNIS | |
FishBase | |
GRIN Taxonomy for Plants | |
iNaturalist | |
IUCN Red List of Threatened Species | |
Phasmida Species File | |
The eBird/Clements Checklist of Birds of the World | |
The Interim Register of Marine and Nonmarine Genera | |
IPNI | |
The Mammal Species of the World | |
Tropicos - Missouri Botanical Garden | |
uBio NameBank | |
USDA NRCS Plant Database | |
ZooBank |
Compounds
[edit]Compounds are currently only matched by InChiKey.
Articles
[edit]Articles are matched using WDTK directly so we can have a match on DOIs that is case-insensitive. We are only matching on DOI as titles have proved to be highly unreliable. Our database currently only contains articles that have DOIs.
Links Compounds-Taxa-Article
[edit]Only new ones are added.
Source of Data
[edit]Data is sourced from several databases and aggregated/curated there is no direct data from any database that isn't linked to others. However we plan to add a system that would add the identifiers from other databases to the entries, but we are not ready for that yet.
Reliability and duplicates
[edit]The existence checks use SPARQL or direct queries (in the case of DOIs as they are case-insensitive and SPARQL queries were way too slow), so impossible to test on the test.wikidata.org instance unfortunately.
Discussion
[edit]--Bjonnh (talk) 20:03, 28 August 2020 (UTC)[reply]
- How many new items will be imported?--GZWDer (talk) 03:30, 29 August 2020 (UTC)[reply]
- Currently we expect ~ 2 million natural product of taxon (P1582) triples, and for the molecules we expect to add at least 50k. We think that the taxons are already pretty complete so it may just be a handful. The references we are probably above 10k. Bjonnh
- Of the really clean data we have:
- 35k taxons
- 196k molecules
- 89k references with DOI a bit more with only a PubMed ID, but we may not include these in the first round.
- 2.5m "triplets" taxon-molecule-reference in practice as we have several references for the same "taxon-molecule" couplet, it doesn't translate 1:1 to the number of triplets (as in RDF triplets) we are going to add. --Bjonnh (talk) 17:36, 29 August 2020 (UTC)[reply]
- Support Note the submitters have extensively informed and discussed with members of the Chemistry WikiProject. See their information page: Wikidata:WikiProject_Chemistry/Natural_products --SCIdude (talk) 15:14, 4 September 2020 (UTC)[reply]
- A general Support, but I would like to see some things further worked out, in particular: I like to see a shape expression, some more information on how items are matched (in what order properties are matched, how conflicts are handled (where InChIKey says identical, but PubChem CID says not), and some further thought if other bots cannot / do not already handle taxons and DOIs. I think citing the primary literature is an awesome idea, but I would also consider citing the database where the info is imported from. Please have a look at the metadata/provenance models by the ProteinBoxBot to see how the references can be improved. Finally, do you expect this bot to be run once, or will it run every few months, to keep up with literature and/or content of the source database? --Egon Willighagen (talk) 05:54, 5 September 2020 (UTC)[reply]
Ok will work on getting shape expressions and think about this conflict problem. For citations, we import the info from crossref and pubmed. We plan to keep that bot running regularly, once every few months sounds reasonable for us. --Bjonnh (talk) 15:17, 8 September 2020 (UTC)[reply]
@Lymantria, Egon Willighagen: We have reworked the bot it has now over 100 entries that it added properly and we verified them manually. Let us know the following steps for the permission. Other question is, if we decide to add a functionality (lets say also checking for Pubchem CID), should we go through the request for permission again, or are we allowed to have incremental changes? Bjonnh (talk) 18:12, 30 November 2020 (UTC)[reply]
- As long as a new functionality fits in the task description you have given and is approved here, you can add them. Otherwise you should indeed come back here.
- I will approve in a couple of days, provided that no objections will be raised. Lymantria (talk) 06:27, 1 December 2020 (UTC)[reply]