Wikidata talk:WikiProject Chemistry Natural products

From Wikidata
Jump to navigation Jump to search
[edit]

It could be :

found in taxon (P703) sounds good to me, but take notice of the semantic annotation that the NP is then a secretion or excretion. --Egon Willighagen (talk) 06:52, 5 September 2020 (UTC)[reply]
@Egon Willighagen can you elaborate here on the "semantic annotation that the NP is then a secretion or excretion" ? GrndStt (talk) 17:33, 4 October 2022 (UTC)[reply]
I am afraid I do not remember the content :( What I can imagine it was about is the following: secretion and excretion (there was a typo in my original message) are two different ways a natural product can be "found" in an organism. Actually, there is a third way: it doesn't leave the organism. I guess the discussion (elsewhere) was about whether we could explain in more detail (ie. more semantic) what the "found" referred too. Btw, a related discussion came up earlier this week: Wikidata_talk:WikiProject_Chemistry/Natural_products#Mapping_%22chemical_found_in_taxon%22_with_more_precision. --Egon Willighagen (talk) 04:41, 5 October 2022 (UTC)[reply]

Volatile compounds (EssoilDB and CEVOpen)

[edit]

These are ongoing projects to create open knowledgebases of plants and their volatile components. The projects have been running for some years, initially manually and more recently automatically. In the last two years we have been indexing against Wikidata by matching terms.

There are a lot (possibly up to 100-250K) of simple, often open, articles of the following sort:

  • introduction (plant, medical need, etc.)
  • location
  • plant methods (growing conditions, location, plant parts, treatment)
  • extraction methods (e.g. steam distillation)
  • chemical composition (names, percentages) - 20-100 compounds are common
  • medicinal and other activities of the oils

We have created about 10 dictionaries for all of these ("facets") and most terms are linked to Wikidata. In a typical run we:

  • search EuropePMC for essential oils
  • download the papers
  • search the full text with the dictionaries.

We'd like to add the data back to Wikidata - but there is quite a lot (often hundreds of data points per article, or hundreds of papers per plant) and there may be quality control concerns. Petermr (talk) 12:58, 21 January 2021 (UTC)[reply]

Hundreds of references are definitely a concern. There is also no positive effect of having 20 instead of 5 references to the same statement. So maybe you want to limit these. If the quality varies and it can be quantized, even better. But that is my personal opinion. --SCIdude (talk) 16:32, 21 January 2021 (UTC)[reply]
It's actually thousands of plants that produce a the same chemical and hundreds of chemicals produced by the same plant. If WD records the plant/s that a chemical if found in, do we want just a random sample? or the commonest chemical in a plant. The statements are all distinct - an article could be making 1000+ statements Petermr (talk) 08:37, 22 January 2021 (UTC)[reply]
1000+ structure-organism pairs in a single article?! Looks enormous...when linking back to articles final goal is to be able to verify the data, so I would personnally avoid "big reviews" listing 100+ compounds and keep only the "high confidence" sources. Pure curiosity, being volatiles, did you also collect the odor? AdrianoRutz (talk) 12:41, 15 May 2021 (UTC)[reply]
I think hundreds+ are a good number for generalizations. If you find that the taxa have a common ancestor this would be evidence for a common phenotype. Doing this for all these sets would be worth a paper. --SCIdude (talk) 16:55, 15 May 2021 (UTC)[reply]

Mapping of artifacts

[edit]

{{Ping project|Chemistry}}

As reported per https://github.com/lotusnprod/lotus-web/issues/26, we currently do not describe known artifacts (of the extraction process, for example). So compounds not present in the taxon itself but consequence of the process. It is a very interesting case since they are somehow different from the case of pesticides found in plants.

We should think of the best way of mapping such compounds.

For the moment, I could think of something like either the 'found in taxon' statement or the chemical itself with something like 'instance of'(?) (chemical) artifact.

Any good idea is welcome! AdrianoRutz (talk) 21:58, 31 May 2022 (UTC)[reply]

Great example. Indeed such distinction would be useful.
Note that regarding the compound cited in the issue (https://www.wikidata.org/wiki/Q419811, diethyl phthalate) at least one reference (https://www.wikidata.org/wiki/Q43777355) seem to indicate that it is indeed produced by a living organism (Helicobacter pylori) GrndStt (talk) 06:13, 1 June 2022 (UTC)[reply]
Thus I guess that such precision should be made rather on the property linking the chemical to the organism rather than on the chemical itself. GrndStt (talk) 06:15, 1 June 2022 (UTC)[reply]

False information in allylbenzene (Q56435819)

[edit]

Yesterday, a new article have been written in pl.wiki about allylbenzene (Q56435819). There was an information about natural occurrence in Alpinia officinarum based on the LOTUS entry, which use the information from WD. This information was added in 2021 by a bot [1] based on Isolation and structural elucidation of some glycosides from the rhizomes of smaller galanga (Alpinia officinarum Hance). (Q44093652). However, I don't see any information about allylbenzene in this paper, I see only structure that have allylbenzene as its substructure.

So, how this happened that allylbenzene (Q56435819) was mapped from Isolation and structural elucidation of some glycosides from the rhizomes of smaller galanga (Alpinia officinarum Hance). (Q44093652) and most important: is this an isolated incident or maybe more false data can be present in WD? Wostr (talk) 15:28, 8 August 2022 (UTC)[reply]

Hi @Wostr! Thank you very much for your report (and sorry for the late reply, did not get the notif). You are right, the information was incorrect and is removed now. I would love to say it was an isolated incident but we rely on what humans did put as "correct" in other source databases. Sometimes, such cases happen, we do our best to avoid them but cannot guarantee their are not others. If so, they are very limited anyway.
Thanks again! AdrianoRutz (talk) 11:37, 4 October 2022 (UTC)[reply]

Mapping "chemical found in taxon" with more precision

[edit]
Adriano Rutz
GrndStt
Jonathan Bisson
Egon Willighagen
Daniel Mietchen
Rod Page
Ralf Stephan
Peter Murray-Rust
Tiago Lubiana
Photocyte

Notified participants of WikiProject Chemistry/Natural Products

This is an open discussion to decide which mapping to adopt in the case the source mentions more granular information about the localization of the chemical.

In the case of plant (Q756), this can be aerial parts (Q96022820) for example. Some items already exist such as passion flower (herb) (Q96211598), but this does not look good.

In the case of Homo sapiens (Q15978631), it could be blood (Q7873) or urine (Q40924).

There is also the case of cell (Q7868) and their compartmentalization, but cells are already only a part of a taxon (Q16521) I guess we should adopt a model that could be used in all those cases. @TiagoLubiana, any advises?

My first idea was to use applies to part, aspect, or form (P518), but I feel this deserves better thinking. anatomical location (P927) also looks like a good candidate.

Fictive proposal:

aspirin (Q18216) found in taxon (P703) Spiraea (Q148745) anatomical location (P927) follicle (Q147807)

aspirin (Q18216) found in taxon (P703) Homo sapiens (Q15978631) anatomical location (P927) urine (Q40924)

aspirin (Q18216) found in taxon (P703) Homo sapiens (Q15978631) anatomical location (P927) colon (Q5982337) applies to part, aspect, or form (P518) Caco-2 (Q5016050) anatomical location (P927) nucleus (Q40260) AdrianoRutz (talk) 12:03, 4 October 2022 (UTC)[reply]

@AdrianoRutz: Good proposal; I think using the applies to part, aspect, or form (P518) qualifier is a nice way to go about it. I'm not so sure about trying to represent too complex relations (like the last example), though. TiagoLubiana (talk) 15:48, 4 October 2022 (UTC)[reply]
It is not complicated. If Caco-2 was replaced with GI epithelial cell, then all three statements are anatomical. Logically the cell line entry (Caco-2) should always point to the tissue type (GI epithelial) where it comes from. SCIdude (talk) 07:24, 5 October 2022 (UTC)[reply]
@AdrianoRutz: Looks like a complex, but interesting proposition ! There is a part of the PO which described anatomical structures. These could be used to have a control on the plant parts used. See https://bioportal.bioontology.org/ontologies/PO/?p=classes&conceptid=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPO_0025034&jump_to_nav=true I guess it should be well described for humans also ... But what about fungi ? Bacteria ? Then (and here I am going in another direction than the anatomical localization) what about the culture conditions ? Or the environmental conditions ? In fact we could also think about refining the occurence of a chemical at the specimen level in the sense of what Plazi defines as a Treatment see https://tb.plazi.org/GgServer/html/039787D4205DFFBCBCD5FDD51BD7FAE1 for ex. GrndStt (talk) 17:17, 4 October 2022 (UTC)[reply]
Wikidata has imported and integrated several anatomical ontologies. Our anatomy ontology spans from organ-->smaller entities-->cell-->cell organelles-->protein complexes, i.e., all included. However, I agree that experimental conditions could be given as qualifiers. SCIdude (talk) 07:09, 5 October 2022 (UTC)[reply]
Caco-2 is a cell line, but a tissue type should be given here, like gastrointestinal epithelial cell. But if the experiment was actually done with Caco-2 cells then you can't really state that the compound was found in a human body, can you? Through hundreds of generations Caco-2 cells have evolved away from typical human GI epithelial cells. SCIdude (talk) 07:22, 5 October 2022 (UTC)[reply]

Mapping _near to ubiquituous_ compounds

[edit]

Following a recent discussion on the Wikidata Telegram channel, it looks like the max size for a page is around 4.3MB, which corresponds more or less to 5,000 statements, less if those are referenced (as in our case). Looking at the usual suspects like Q121802 with > 2,000 statements, we should probably think about a better way to handle this. What do you think? @Egon Willighagen, @SCIdude, @GrndStt, @Bjonnh, @Daniel_Mietchen AdrianoRutz (talk) 14:57, 12 December 2022 (UTC)[reply]

An actual possible solution would be to move from mapping "found in taxon" on the chemical compound item to the reference item using either main subject (P921) or something that could express an "inverse stated in (P248)" (and this for all statements, not only ubiquituous ones). AdrianoRutz (talk) 19:32, 13 December 2022 (UTC)[reply]
Another idea (for the "found in taxon" problem) could be to have a property that instead expresses "mostly found in taxon", and use that with the parent taxon. This would also make sense biologically because taxa actually most often are defined by their chemistry. Or, even simpler, for any problematic compound you create an inofficial taxon (i.e. group) that is defined by the presence of the compound. Example: creating "caffeine-producing Camellia" as a group and use that instead of Camellia. Lastly, the problem with Q121802 is that it is a primary metabolite, and I think this distinction should be somehow addressed because I think this problem is rather a problem with primary than with secondary metabolites. --SCIdude (talk) 09:05, 14 December 2022 (UTC)[reply]
This is a very important point which in fact maybe not only applies to "near to ubiquitous" compounds but has further implications.
An alternative mapping could be to have the chemical compounds items on the taxon page. Maybe via a generic contains (P4330) statement ?
As immediate effect this would reduce the maximal number of statements (per item) linking chemical compounds and taxa from ca. 4000 to ca. 700 (See Fig. 4 in https://elifesciences.org/articles/70780).
On the longer term, we can anticipate that technical limits in terms of limits of detection and physical/chemical/biological limits in terms of metabolic content of a given taxon should lead to a quicker plateau for the curve corresponding to "taxon X contains chemical structures" than for the curve corresponding to "structure X found in taxa".
In other words we have ca. 8M estimated species on the planet and we can expect most (all ?) of them to contain water (Q283), glucose (Q37525) or adenine (Q15277). On the other hand, we have an estimated 500,000 to 1M of natural products reported to day (very broad estimate), given the metabolite specialization gradient, it is very unlikely (impossible ?) to find a single organism containing all of them.
Furthermore if, on a longer term, things progress in the reporting of taxonomic treatment (P10594) in literature, "taxon X contains chemical structures" would evolve in "specimen X contains chemical structures", further simplifying the number of statements, while offering possibilities of describing the intra-specific metabolic variations governed by environmental conditions. But this is another story. GrndStt (talk) 08:40, 22 December 2022 (UTC)[reply]
Adriano Rutz
GrndStt
Jonathan Bisson
Egon Willighagen
Daniel Mietchen
Rod Page
Ralf Stephan
Peter Murray-Rust
Tiago Lubiana
Photocyte

Notified participants of WikiProject Chemistry/Natural_Products As this is an important topic, we would love to have all your opinions. I am currently quite convinced mapping it as http://www.wikidata.org/entity/statement/Q91218352-693517aa-4f7a-1a83-6669-d48d35007878 might be the right way to go mainly because it reduces a lot the number of possible statements per item for the future. If you know external people (taxonomists, etc.) that might help do not hesitate to ping them. AdrianoRutz (talk) 08:40, 28 December 2022 (UTC)[reply]

The example statement you give is one part of the solution. However, I think primary metabolites shouldn't even be mentioned, as the primary metabolism is by definition a general feature of cells. There may be some specific to plants or animals but this is the only restriction. I'm in the process of marking compounds like in https://www.wikidata.org/wiki/Q283#P279, which should catch most of high volume "found in" cases. When this is done we can decide how to handle existing statements. SCIdude (talk) 15:07, 23 February 2023 (UTC)[reply]
OK I have to concede that marking primary metabolites does not solve the problem. While beta-sitosterol is a PM, the compound that has the 2nd most occurrences is quercetin (Q409478) with 2,392 occurences. And it is a secondary metabolite, so this is not a good criterion on its own, although I still think PMs should not list their occurences. SCIdude (talk) 15:22, 15 March 2023 (UTC)[reply]

Many capital letters

[edit]

I’ve popped into a group of bot created items such as Q115784449 They are said to be instances of chemical compounds. I’ve googled and found no matches. I’ve tried using P235 as provided and also found nothing. I don’t know chemistry so maybe there is an explanation I don't see.

In addition, the same bot is creating items of scientific articles (e.g. Q115784447) with LABELS ENTERELY IN CAPITAL LETTERS. I think that's a mistake that should be addressed before creating, but in any case has to be mended.

Thank you! B25es (talk) 10:30, 21 December 2022 (UTC)[reply]

Hi @B25es! Thank you very much for letting us know so rapidly!
Regarding the chemical compounds, we try to import the IUPAC or common name when available, except when it is >249 characters long. In this case, we upload the InChIkey, as a partial name would make no sense. This identifier can then be used to look for possible shorter synonyms as in PubChem, for example. This solution is not ideal but we aren't able to do better currently, sadly. I am running maintenance queries regularly to see if some "InChIkey-labeled items" can be renamed.
Regarding the titles of the articles, we rely on CrossRef metadata. If the entry has FULL CAPITAL LETTERS in the title, then we upload it so. We are trying to remove all html tags from titles already automatically, but the way they are inserted by publishers being so dramatically heterogenous, it is a non-trivial task. Happy to hear if you have some proposals to improve current situation
AdrianoRutz (talk) 11:35, 21 December 2022 (UTC)[reply]