Wikidata talk:WikiProject Chemistry/Archive/2021

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

A lot of duplicate data

Since several weeks a lot of duplicated data were generated. I don't want to blame anyone, I just want to remind that a check if necessary after a merge of the addition of data.

See constraint violation reports for

InChI : [1]
InChIKey: [2]
CAS: [3]

Most of those problems are corrected after some days, but please have a look. Snipre (talk) 14:30, 11 November 2019 (UTC)

Mmmm... a lot of new chemical entries with very minimal information and indeed many duplicate CAS registry numbers. Not so happy about this either. It has been brought up, but it's not clear what the situation of resolving the problems is. --Egon Willighagen (talk) 15:02, 22 November 2019 (UTC)

The current situation is that we have a lot of duplicates and we have to merge then manually. The format of CAS numbers in these new items have been corrected, so some items can be quickly merged, but because some chemical compounds may have more than one CAS number, there may be items that are in fact duplicates, but won't show on any constraint violations list and it will be problematic to find those duplicates. Wostr (talk) 16:00, 22 November 2019 (UTC)

Note the conflict reports are somewhat behind. Also I went through all InChi key duplicates and had to leave those pairs that were tautomers (I marked them), because InChi keys for tautomers apparently can be (are?) identical. The actual numbers from fresh queries are:

InChi: distinct 18 (report 5+1), single 26 (report 33)
CAS: distinct 400 (report 536), single 87 (report 91+8)
InChi key: distinct 27 (report 28+2), single 26 (report 32)

With this query I count 13 tautomer pairs that have identical InChi keys, so I'll go through the others again:

SELECT DISTINCT ?item1 ?item1Label ?item2 ?item2Label ?value 
{
	?item1 wdt:P235 ?value .
	?item2 wdt:P235 ?value .
       ?item1 wdt:P6185 ?item2 .
	FILTER( ?item1 != ?item2 && STR( ?item1 ) < STR( ?item2 ) ) .
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
}

--SCIdude (talk) 17:08, 23 November 2019 (UTC)

Standard InChI/InChIKey is identical for tautomers, but InChI software can produce a non-standard versions of InChI/InChIKeys. However, I don't know of any software that can easily generate non-standard InChI – if we have one, we could change single-value constraint (Q19474404) in InChI (P234) to single-best-value constraint (Q52060874) or better update it with separator (P4155) so as to we could have both InChI in one item with a qualifier that describes if it's a standard or non-standard InChI. Wostr (talk) 18:21, 23 November 2019 (UTC)

The Chemistry Development Kit (Q2383032) can do this. I can make a script for this. --Egon Willighagen (talk) 20:28, 23 November 2019 (UTC)

How it could work, i.e. how we could generate non-stanard InChI/InChIKeys with it? (I'm not very good at technical things; is it w software that anyone can run?) Wostr (talk) 15:49, 30 November 2019 (UTC)

Tautomer/zwitterion

@Wostr, Egon Willighagen, SCIdude: By creating dedicated items for different tautomers or zwitterionic forms, and adding all identifiers to all tautomer/zwitterion forms, we are generating contraint violations for most identifiers. How can we handle that problem ?

Some solutions:

Put all constraint violations related to tautomers and zwitterion in the exception list
Between the different forms and according to a defined set of criteria, choose one form which will because the chemical compound and the other forms will be defined as instance of tautomer/zwitterion. All undefined identifers will be linked to the chemical compound item, with all general properties.

The second one is the best according to my opinion because we avoid to work with 2 items at the same time: most of the time we only have data for undefined tautomer/zwitterion form. Snipre (talk) 15:01, 30 November 2019 (UTC)

We are not the only ones having separate entries, ChEBI has too, so with solution 2 you need to decide which ChEBI id to link or get constraint violations with two ids. You could also remove some constraints as a different solution. --SCIdude (talk) 15:20, 30 November 2019 (UTC)

While I am not arguing against the concerns, I have mixed feelings about not allowing tautomers and zwitterions. Particularly tautomers have different physchem properties, and even zwitterions can be linked to experimental data (e.g. crystal structures). I also do not currently have a good suggestion. One issue is that tautomers are ill defined, and particularly in the context of the InChI(Key), where the algorithms has it limitations. --Egon Willighagen (talk) 15:32, 30 November 2019 (UTC)

@Egon Willighagen: Nobody was proposing to ban the creation of item for tautomer or zwitterion: the discussion is to find a good way to integrate those particular cases in WD. Snipre (talk) 16:29, 30 November 2019 (UTC)

Which ids cause constraint violations? I know that InChI/InChIKey does, but that problem requires finding a way to generate non-standard InChI/InChIKey. The standard InChIs/InChIkeys should be present in both items – neither InChI nor InChIKey is 100% unique for chemicals. For zwitterions: instance of (P31)zwitterion (Q245115)/subclass of (P279)zwitterion (Q245115) should be always present and for zwitterions you can tell if an ids refers to the neutral/zwitterionic form by SMILES/systematic name for example. For carbohydrates (chain/ring structure): the same, InChIs are different, SMILES are different, even systematic names are different. For compounds with mobile-H: usually the same.
The real problem with tautomers is the InChI/InChIKey, but that's not only our problem, it's the problem of the standard configuration of InChI software and it's a known issue that is solvable by generating non-standard InChI/InChIKey. Then we only have to decide what to do with two InChI values in one item (deprecate StdInChI, prefer NonStdInChI etc.). Wostr (talk) 15:45, 30 November 2019 (UTC)

@Wostr, Egon Willighagen, SCIdude: I will try another approach:

Zwitterion case:

Always consider the neutral form as the chemical compound form. A second item for the zwitterionic form can be created with the following properties

Neutral form	Zwitterion
instance of: chemical compound	instance of: zwitterion
All IDs and properties for the neutral form, for mixtures of neutral form and zwitterion form or undefined form (Sdt InChI and InChIKey)	IDs and properties only for the zwitterion form (non-standard InChI and InChIKey)

Tautomer case:

The most stable form or the form which is present in excess is defined as the form A in standard conditions. The other form, is defined as Form B.

Form A	Form B
instance of: chemical compound	instance of: tautomer
All IDs and properties for form A, for mixtures of A and B forms or undefined form (Sdt InChI and InChIKey)	IDs and properties only for the form B (non-standard InChI and InChIKey)

Snipre (talk) 16:42, 30 November 2019 (UTC)

No objection from me. Implementation of the zwitterion case can be automated if the compounds are in ChEBI (ChEBI explicitly names zwitterions). Additionally, a metaclass "class or group of zwitterions" may be needed, ChEBI has a hierarchy for them. --SCIdude (talk) 17:05, 30 November 2019 (UTC)

I can't agree to everything above. StdInChI is valid for both forms (neutral and zwitterionic) and we should find a way to model this properly, Non-standard InChI is an addition that may help in distinguishing the forms, but is not a substitute. instance of (P31)tautomer (Q334640) for only one tautomer is also wrong; both are tautomers in the same way of each other; also, as tautomer of (P6185) is present, I don't think we need to explicitly classify compounds as tautomers (similarly, we don't classify compounds as stereoisomers); both should be classified according to its structure etc. I can agree to that part 'all IDs and properties for form A, for mixtures of A and B forms or undefined form' with an exception for cases (if there would be any such cases) when ID clearly distinguish form A/form B/mixture of A and B/undefined form. Also, there may be situations when we should keep an ID with a deprecated rank in one item and have it in a second item with a normal rank. 'Additionally, a metaclass "class or group of zwitterions" may be needed' is not needed — zwitterionic form has a charge of 0, so I don't think we need to classify them in a different way as chemical compounds (only instance of (P31)zwitterion (Q245115)/subclass of (P279)zwitterion (Q245115)). Wostr (talk) 20:29, 30 November 2019 (UTC)

@Wostr: The problem of the StdInChI is applicable to most identifiers: so why do we have to treat StdInChI in a particula way ? We have to find a solution for all identifiers.

Then can both tautomers be a chemical compound or will tautomer be a subclass of chemical compound ? This more critical in term of ontology.

In anyway, we can't treat both tautomers in the same way, or we will have to create a third item which will be tautomer undefined. Snipre (talk) 11:46, 1 December 2019 (UTC)

tautomers in the same way – in regards to classification; classifying only one tautomer as tautomer is not correct, classifying both seems redundant to me (these items already have tautomer of (P6185)). I asked, which IDs are causing problems similar to InChI/InChIKey? Because I think most of the problems can be solved only by checking the data in the source: we have DTXSID50274234 in pyridine-3,4-diol (Q74411505) and 3-hydroxypyridin-4(1H)-one (Q27891533), but the source clearly states the IUPAC name, has structure shown, has SMILES. If we have a real problem in which the source has e.g. IUPAC names for both tautomers, SMILES for both etc., we can either move the IDs to the prevalent form, or (IMHO better option) deprecate the IDs in the less common form with proper reason for deprecated rank (P2241). Wostr (talk) 15:01, 1 December 2019 (UTC)

@Wostr: This is perhaps not correct in an ideal classification but we need a pragmatic solution. So please provide a complete solution to my question regarding how do you plan to link the tautomers to higher classes ? Do you plan to define both tautomer as instanc of chemical compound or any subclass of chemical compound ? This is not correct because both tautomers are not different chemical compounds.

And following your proposition for IDs, this means we will have for the same chemical a splitting of the IDs between 2 items, this reducing the capacity of connections of external databases through an unique WD item, especially when external databases are not defining different ID for tautomers. Snipre (talk) 14:43, 13 December 2019 (UTC)

@Snipre: I though I answered this, but apparently it has not been saved. I don't think we need any special solution regarding tautomers in regards to their classification, any tautomer should be classified according to the structure and/or other qualities. E.g. carbohydrates have 'group of isomers' items, then can be linked to carbohydrates (there could be also link to specific classes of heterocyclic compounds for closed ring forms and aldehydes/ketones for open chain forms etc.). In Wikipedias there was always problem with categories for compounds having different tautomeric forms — which category should be assigned. Here we can assign different classes for different tautomeric forms. This is not correct because both tautomers are not different chemical compounds — this is not so obvious, tautomers are defined simply as 'isomers' with one specific feature that are 'readily interconvertible'. this means we will have for the same chemical a splitting of the IDs between 2 items, this reducing the capacity of connections of external databases through an unique WD item – we already have this in items for which an external, reliable source incorrectly gave an ID which is correct for other chemical compound (such statement is deprecated in WD, but still an ID exists in two items). This is not something that should occur frequently, but is unavoidable. We just have to limit this to cases where it is necessary and mark such statements clearly (qualifier, rank). Wostr (talk) 17:42, 5 January 2020 (UTC)

@Wostr:

This is not correct because both tautomers are not different chemical compounds — this is not so obvious, tautomers are defined simply as 'isomers' with one specific feature that are 'readily interconvertible'.

This is not correct if you consider the fact that chemical compound is a subclass of chemical substance and if you consider the definition of chemical substance: "Matter of constant composition best characterized by the entities (molecules, formula units, atoms) it is composed of. Physical properties such as density, refractive index, electric conductivity, melting point etc. characterize the chemical substance."

Using the inheritance property of subclass relation, chemical compound should have defined physical properties. Isolated tautomers don't exist but in most cases, the equilibrium beteween tautomers favors one form. Based on that reasoning I continue to say that one form is a chemical compound, the most thermodynamically stable one, because properties measured are mainly resulting of that form, and the second form should only defined as tautomer because it is a kind of hypothetical chemical compound (exists, but not isolable). As simple rule, for keto-enol tautomers, we should define keto tautomers as chemical compound and enol as tautomers only, as keto are the most stable form.

this means we will have for the same chemical a splitting of the IDs between 2 items, this reducing the capacity of connections of external databases through an unique WD item – we already have this in items for which an external, reliable source incorrectly gave an ID which is correct for other chemical compound (such statement is deprecated in WD, but still an ID exists in two items)

This way of doing is just a propagation of errors and incoherences. Wikidata is not only a simple compilation of data, but should generates an ontology and should be able to provide a logic for machines. This implies to not only observe and mark errors but to try to correct them by alerting the databases and spotting the problem to their attention. Snipre (talk) 04:26, 5 March 2020 (UTC)

Ad 1: Using your argumentation your proposal that one tautomer should be an instance of chemical compound and the other(s) should be instance(s) of tautomer is not correct, because the chemical compound being a hybrid (in fact a mixture) of tautomers should be an instance of chemical compound and every tautomer an instance of tautomer – as you never have a 100% pure substance composed of only one tautomer. Using your proposal for simple annular tautomers may seem simple, but in which phase/conditions you want to measure which one tautomer is prevalent? It seems not so simple for carbohydrates: open chain-ring tautomers – which one is prevalent and why?

Ad 2: As I said, having the same IDs in more than one item is unavoidable (not necessarily for tautomers, but in general), so if there is a need in a particular item describing tautomer to add ID that is added somewhere else, the only thing we should care about is to properly describe the situation using qualifiers and ranks. Wostr (talk) 13:37, 5 March 2020 (UTC)

@Wostr: Concerning your Ad1: if we are not able to isolate one form then there is no reason to create 2 items, one for each form and none should be defined as instance of chemical compound. I was not the one creating items without having a correct data model to propose, so I think we should merge all tautomers: we eliminate the problem of the constraint violations and we keep a coherent data model. Snipre (talk) 19:21, 27 May 2020 (UTC)

Merging items about tautomers is not an option IMHO. That would be a nuclear option. Wostr (talk) 23:08, 27 May 2020 (UTC)

@Wostr: Your argumentation is very impressive:

What do we loose by merging ? Nothing because we can always store the data in one item. If each tautomer can't be isolated then they can't be considered as chemical compound (just have a look at the definition of chemical substance which is the upper class of chemical compound: the notion of defined physical properties is mentioned): if we can't isolate the tautomer and perform some physical measurement, then this is not a chemical substance and not a chemical compound. This means we will have 3 items: one for each tautomer defiend as instance of hypothetical chemical compound (Q50308749) or tautomer (Q334640), but not as chemical compound (Q11173) and a third one which is a kind of chemical with undetermined structure but which respects the criteria of the chemical substance definition (defined chemical composition and measured physical properties). But tautomer items could not use properties like InChIKey (not specific for tautomer) or other physical properties. This will just add mess to the current situation during data import from other databases. Snipre (talk) 10:46, 2 June 2020 (UTC)

My argumentation reflects my desire to discuss this further, I really see no point as I don't like to write and read the same argumentation over and over. We lack too many things to proceed with this topic – proper classification of compounds that can be a basis for attempting to include tautomers in this classification; participants in this project and this discussion, because without more participants, I don't think there will be some sort of a consensus here. Ad rem: some tautomers can be isolated, some cannot. We can't isolate some chemical compounds, but still we have items about them; also, we can't isolate some ions, be we still have items about them. InChI and InChIKey can be properly assigned to every tautomer, but it won't be StdInChI/StdInChIKey — as I wrote above, I don't think that assigning the same StdInChI to more than one item while also having Non-StdInChI in these items is a problem. Quite the opposite, in ChemSpider for example you have such situations. Wostr (talk) 14:58, 2 June 2020 (UTC)

Meanwhile Rhea, the current top enzymatic reaction database have extended their reactions that use the physiologically correct zwitterions. Since they consequently use ChEBI, ChEBI now has zwitterions for each and every metabolite that was shown to be in that form. As our enzymatic activities link to reaction participants via ChEBI identifier I am now in the (manual) process of de-merging all items with duplicate ChEBI identifier, yielding a lot of zwitterions and probably some ChEBI duplicates to submit. Which is good in terms of constraint conflicts too. --SCIdude (talk) 16:27, 1 April 2021 (UTC)

Non-standard InChI

ChemSpider do have non-standard InChIs/InChIKeys (don't know, however, with what options), but there is no entries for tautomers (at least not for the few I checked). Wostr (talk) 22:40, 10 December 2019 (UTC)

@Snipre, Egon Willighagen, SCIdude: Regarding duplicated InChIs and InChIKeys for tautomers – in pyridine-3,4-diol (Q74411505) and 3-hydroxypyridin-4(1H)-one (Q27891533) there is my proposal for InChI/InChIKey for tautomers that gives a constraint violations (I've modified InChI (P234) and InChIKey (P235) a bit, but left distinct-values constraint (Q21502410) in place). For each tautomer we have StdInChI+Key (identical for tautomers) and different Non-StdInChI+Key. Instead of distinct-values constraint (Q21502410), we can add single-best-value constraint (Q52060874) for both properties, so items not manually curated (with more than one ID) should still cause a constraint violation. Non-Standard IDs in these two examples were generated using official InChI Software based on a MOL file. Wostr (talk) 18:50, 14 July 2020 (UTC)
BTW there is also the possibility to create our own adapted constraint (so called "complex constraint"), see example. With an adapted constraint the original constraint can be removed. Complex constraints are NOT run with every statement edit, so do not produce warning signs, they do produce reports however, see example. --SCIdude (talk) 07:19, 15 July 2020 (UTC)

I will add a complex constraint that does not trigger if both items have mutual "tautomer" statements. --SCIdude (talk) 07:21, 15 July 2020 (UTC)

Please discuss the complex constraint at Property_talk:P235#modified_unique_value_constraint. --SCIdude (talk) 07:36, 15 July 2020 (UTC)

GZWDer added all (most?) of the US EPA CompTox dashboard

Hi all, GZWDer (talk • contribs • logs) copied in more or less the full CompTox Chemistry Dashboard (Q26998510) which brings in some 800 thousand new DSSTox substance ID (P3117)s. Along, it also makes the number of CAS registry numbers to >800 thousand. Let's see how that goes with Chemical Abstracts. Currently, there is molecular formula, mass, SMILES, info missing, but I can write a script tomorrow to generated QuickStatements to add missing info (using PubChem to convert the InChIKey to SMILES). Please don't do this manually. --Egon Willighagen (talk) 08:53, 30 January 2020 (UTC)

This is insane... [4]: +533 685 bytes.... Wostr (talk) 23:14, 30 January 2020 (UTC)

@GZWDer: how do you propose to resolve this? --SCIdude (talk) 09:43, 31 January 2020 (UTC)

Hi all, after a long quarantaine (well, ongoing), but not having holiday, I started add missing SMILES. I'm currently doing the easy ones: InChIKey that have SMILES of a single molecule (no salts) and have full stereochemistry defined. The workflow is like this: get a SMILES from PubChem using the InChIKey, use the CDK to recalculate the InChIKey, and proceed only if a match. This (chiral) SMILES is then taken to a second step in which the SMILES is searching in Wikidata by a match on the InChIKey (again, with the same CDK) and with the PubChem CID (so, some redundant work, but it allows me to work with already proven code; see https://github.com/egonw/ons-wikidata/tree/master/Wikidata). This creates QuickStatements. This way, I've "resolved" some 20 thousand of the 800 thousand issues (at the time of writing). This is going to take some time, and there is room for improvement. One thing I started working on, which will improve performance, is output v2 QuickStatements. I'm finishing a last round with v1 QuickStatements, but the next one should be with v2. --Egon Willighagen (talk) 08:10, 26 July 2020 (UTC)

Okay, playing with v2 did not help. The code is updated for it, but it has a number of limitations: 1. it doesn't do sparse data well (v2 is tabular, so you get a lot of empty cells); 2. it still does not group edits for a single item (I think this was already known, but now I've seen it with my own hands). So, I reverted back to v1 QuickStatements. By now, I've added missing info for another 100 thousand items, and the number of Wikidata items with InChIKey but no SMILES is now below 700 thousand. --Egon Willighagen (talk) 05:38, 13 August 2020 (UTC)

(Topic continued at 604_duplicate_InChIKeys)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:35, 11 May 2021 (UTC)

New property proposals

I have proposed some new identifier properties. Comments are welcome.--GZWDer (talk) 04:41, 28 March 2020 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:34, 11 May 2021 (UTC)

Difference between CAS numbers

Hi, we have 2 items which are similar pentyl 2-furoate (Q27269583) and pentyl furoate (Q72479642). The only difference is the CAS numbers: 4996-48-9 and 1334-82-3. Reaxys has two entries too but no clear explanation about the difference. Can someone have an idea about the reason of the 2 CAS numbers ? Thanks Snipre (talk) 11:22, 12 June 2020 (UTC)

The first CAS specifies synonyms with the acid on 2-position, the second does not. --SCIdude (talk) 14:55, 12 June 2020 (UTC)

I have checked them in SciFinder. 4996-48-9 is Pentyl 2-furoate or 2-Furancarboxylic acid, pentyl ester. 1334-82-3 is Amyl furoate or Furancarboxylic acid, pentyl ester. The former is one of the isomers of the latter. --Leiem (talk) 13:33, 17 June 2020 (UTC)

@Leiem: Thank you for your answer. So if I understand, 1334-82-3 is for mixtures of pentyl 2-furoate and pentyl 3-furoate. Snipre (talk) 19:15, 14 July 2020 (UTC)

Yes. --Leiem (talk) 11:54, 15 July 2020 (UTC)

Done Snipre (talk) 11:35, 27 July 2020 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:34, 11 May 2021 (UTC)

Q5173335

The Wikipedia article seems about a group of compounds instead of a specific one.--GZWDer (talk) 02:59, 20 June 2020 (UTC)

Note that some WP articles are about Kortistatin A and some about the group. --SCIdude (talk) 07:32, 20 June 2020 (UTC) Resolved.

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:33, 11 May 2021 (UTC)

604 duplicate InChIKeys

(continued from GZWDer_added_all_(most?)_of_the_US_EPA_CompTox_dashboard)

Just a note that we are at 604. Wasn't it below 200 half a year ago? --SCIdude (talk) 07:12, 11 July 2020 (UTC)

@SCIdude: There was a huge data import some months ago from DSSTOX database where a lot of InChIKey duplicates exist. The reason is the creation in DSSTOX database of a lot of entries from ChemIDplus where several entries exist for the same InChIKey but with different CAS number. So DSSTOX prefers to ensure an unique entry per CAS number even if this generates InChIKey duplicates. The origin is a poor definition of chemicals in ChemIDplus where racemate or some stereoismers were not correctly identified.

I have no contact for the DSSTOX database and my emails never got some feedback concerning how to clean DSSTOX database. Snipre (talk) 12:26, 12 July 2020 (UTC)

Thanks for the background. Still it is easy to check if an InChi key already exists, so the person doing the import had no idea what s/he was doing and should be stopped from running bots, in general. --SCIdude (talk) 08:05, 13 July 2020 (UTC)

@SCIdude:@Snipre: I am the project lead for the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) that is the community-facing website for the DSSTox database. I have requested a dump file for the latest release to look at the duplicates. There are various reasons this can happen including InChIs not having advanced stereochemistry so while the V3000 molfile may have different stereochemistry from a different stereovariant when the InChI is generated they will become equivalent. SO chemicals can have different names, CASRN and v3000 mols but the same InChIKey --Antony Williams 23:22, 26 July 2020 (UTC)

I have now cleared about 100 of these duplicates, and in my opinion the duplicate keys come from the associated CAS. The person importing did only check for CAS uniqueness and so even created DSSTOX duplicate statements. --SCIdude (talk) 07:05, 27 July 2020 (UTC)

@ChemConnector: Thank you for your answer. Duplicates and curation are a problem we can handle, but only if corrections are made in the database which generates the data. It could be good if we can report in a simplified way the duplicates we found and the result of the analysis, in order to provide a good input to the database administrator. Do you see a problem if we mention the cases in your talk page ? I can use the dashboard you pointed but I miss a feedback saying the problem is under resolution. I suppose you have plenty of other things to do so perhaps working by batch of cases instead of sending one mail for each case can help. Let us know. Snipre (talk) 11:50, 27 July 2020 (UTC)

@Snipre: The most ideal way to do this for us is that someone registers the comment(s) directly against a particular chemical record. If you watch the video here https://www.youtube.com/watch?v=9A9sWRbJrYA starting at 45:05 it tells you the process to submit the comments and when they are resolved the submitter gets a response and the comment is public. See: https://comptox.epa.gov/dashboard/comments/public_index. These keeps track of the comments publicly, registered against the actual record, and makes the curation public. Would this work?--Antony Williams 12:22, 27 July 2020 (UTC)

@ChemConnector: Thanks,I will test that, Regards. Snipre (talk) 19:38, 27 July 2020 (UTC)

@Snipre: Please see your first comment response on here: https://comptox.epa.gov/dashboard/dsstoxdb/results?search=DTXSID20975867#comments

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:33, 11 May 2021 (UTC)

CAS and unspecified stereochemistry

When CAS doesn't define some stereo or cis/trans center I have come to the conclusion that they always mean the racemic mixture. One reason is they are a product-oriented database, and they have also no ontologic hierarchy for their items, unlike ChEBI. Do you agree? If you agree then there are 250 such wrongly placed CAS statements:

SELECT ?item ?itemLabel 
WHERE 
{
  VALUES ?class { wd:Q55662548 wd:Q55662547 wd:Q15711994 }
  ?item wdt:P31 ?class.
  ?item wdt:P231 [].
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

--SCIdude (talk) 08:13, 3 August 2020 (UTC)

Jun Namkung (Q55662547)? You're probably right, but it have to be done with caution. Many databases do not differentiate 'cpd with unspecified stereochemistry', 'cpd with unknown stereochemistry' (we treat the preceding two as one) with 'racemic mixture', so in many situations InChI/InChIKey and/or other IDs are incorrectly linked in these DBs with racemic mixture entries. I would advise to not simply delete CAS numbers from our items, but deprecate them with a proper reason for deprecated rank (P2241). And in the future more attention should be given to automatic imports to items about racemic mixtures. Wostr (talk) 09:04, 3 August 2020 (UTC) Edit: we still have some items that mix the two concepts and still we have thousands of items without proper classification as a group of isomers. Wostr (talk) 09:06, 3 August 2020 (UTC)

Usually I only delete CAS numbers when there is no (longer a) CAS page, or if the link redirects to a CAS we have. What is the point of keeping these? You would also not import a deprecated CAS, would you? --SCIdude (talk) 09:24, 3 August 2020 (UTC)

The point of keeping deprecated IDs is simple — such IDs won't be imported in the future as correct ones. This, of course, applies when at least one of the databases have CAS number linked to the wrong entry or the database does not differentiate between concepts like we do. If CAS number was correct in the past and now is deprecated — this is also a valid reason for keeping this in WD (that's why there is a deprecated rank at all). I wouldn't import deprecated statements to WD, but some deprecated IDs should be kept in WD to ensure an appropriate linkage between databases, to provide an adequate explanation of why the ID is in a certain item and not in another, and to prevent against automatic import of incorrect data in the future. BTW which 'CAS page' do you mean? Wostr (talk) 14:13, 3 August 2020 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:13, 11 May 2021 (UTC)

InChI strings in Wikidata missing 'InChI=' prefix

There are almost 1 million (999176) chemical compounds with an InChI string identifier in Wikidata. However, none of them have the prefix 'InChI=' (captilization important), even though it is in the specification^1,2.

Can the entries please be updated to include the 'InChI=' prefix?

References

See Property_talk:P234#Missing InChI=. Wostr (talk) 20:10, 16 September 2020 (UTC)

I can do. I'll make it a module of the maintenance bot, i.e. Scidudebot. After some thinking, there may be a way to solve any timing problem. --SCIdude (talk) 14:27, 17 September 2020 (UTC)

Bot is running now. At about 20 edits/minute max (actually less at the moment) it will take more than a month for all items with InChi. The only problem with this is that the updated links will not work until the P234 entry is changed. Ideally we want this change when half of the items are done, in order to minimize complaints about not working links. --SCIdude (talk) 14:44, 18 September 2020 (UTC)

Thanks for setting up the bot. Just out of interest why does the process take so long? I am monitoring the updating using this query. --Stuchalk (talk) 19:07, 23 September 2020 (UTC)

A miscalculation. It already finished a week ago. --SCIdude (talk) 13:27, 11 October 2020 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:12, 11 May 2021 (UTC)

PubChem 2D structures

Can you confirm that the 2D structure in https://pubchem.ncbi.nlm.nih.gov/compound/198165 does not correspond to the InChi 3D? In particular, if you place the benzol with the methyl to the left, the N-heterocycle should be behind the ring system, contrary to what the 2D suggests. I might have seen more such cases already. --SCIdude (talk) 14:24, 21 September 2020 (UTC)

Redrawing the PubChem structure in ChemDraw gives the same stereochemistry and the same InChI as in PubChem. Saving this structure in .mol and opening in InChI 1.05 software gives the same results (InChI from PubChem and from InChI 1.05 is the same). However, 2D structure in PubChem is unintuitive and does not seem to be the best option to visualise this stereochemistry. Wostr (talk) 21:32, 23 September 2020 (UTC)

Thanks for the confirmation. --SCIdude (talk) 05:09, 24 September 2020 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:12, 11 May 2021 (UTC)

ChEBI and mapping type

The new constraint on ChEBI ID (P683) to always have a qualifier mapping relation type (P4390) is an interesting idea. When is the mapping exact? The ChEBI InChi key has to match the item key, of course. Since I'm soon done with checking all differences, it might be an idea to add exact mapping for all items with ChEBI that have a single key with the latest ChEBI release as reference, because these are the ones that were confirmed to be matching. Opinions? --SCIdude (talk) 17:44, 19 October 2020 (UTC)

As ChEBI could be our best chance for classification of chemical species, I thought that it would be good to know if there is 1:1 relation between ChEBI and WD — that is not always true, because we have sometimes IDs for zwitterion linked to regular item etc. I put this constraint with suggestion constraint (Q62026391). I usually use SKOS to indicate that the ID was manually checked and there is certainty that there is 100% equivalency between WD entry and ChEBI entry. Wostr (talk) 18:29, 19 October 2020 (UTC)

First it would be necessary to add the constraint in the examples mentioned in ChEBI ID (P683) to understand what kind of value to add to this new constraint. Then if the reference information are added based on help:sources, there is no need for additional constraint. Snipre (talk) 13:45, 23 October 2020 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:11, 11 May 2021 (UTC)

Q3268366/Q56702552

Some sitelinks needs moving. Check the labels too.--GZWDer (talk) 18:02, 15 January 2021 (UTC)

Done. --SCIdude (talk) 18:14, 15 January 2021 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:10, 11 May 2021 (UTC)

Untangling CAS IDs

Q27430423/Q5954337. ChemIDplus named "silicic acid".
Q866179/Q59624471. ChemIDplus named "carbon".
Q912226/Q4251817. ChemIDplus named "sodium hypochlorite".
Q4103521/Q21057316. ChemIDplus named "pitch, coal tar, high-temp."
Q381899/Q28852421
Q114675/Q1722299
Q219660 (described as both a color and a compound)/Q16039698

--GZWDer (talk) 15:48, 1 February 2021 (UTC)

AICS obsolete

The AICS ids on chemicals are a completely loss. The database has been replaced [5] and the new database does not use the old identifiers. Not only do our identifiers now link to the Wayback Machine but, because the database was never really functional, the archived pages do not show anything.

I'm proposing to delete the identifiers. There is no point in having deprecated identifiers that were never functional. Except you can show that these IDs are in use somewhere else. --SCIdude (talk) 07:00, 28 March 2021 (UTC)

@SCIdude: You should probably bring this up in the Wikidata:Properties for deletion page. ArthurPSmith (talk) 12:58, 29 March 2021 (UTC)

Thanks. Please see Wikidata:Properties_for_deletion#AICS_Chemical_ID_(P7049). --SCIdude (talk) 14:53, 29 March 2021 (UTC)

First CAS number validation results

Hi all, continuing on the earlier discussion, and noting it took me a bit longer to update the data (deadlines), but I have the first CAS validation results. I am checking if and when I can share all the details, but have started adding confirmation that some CAS numbers are correct: https://w.wiki/39jj The model I am using for the reference at this moment looks like this:

stated in [P248]: CAS Common Chemistry [Q18907859]
retrieved [P813]: 2021-04-01
reference URL [P854]: "https://commonchemistry.cas.org/detail?cas_rn=133-99-3"
InChIKey [P235]: "GUBGYTABKSRVRQ-QUYVBRFLSA-N"

I plan to convert this into a shape expression something this Easter break. And then I move on to using this to check Wikipedia. --Egon Willighagen (talk) 11:09, 2 April 2021 (UTC)

So, what about the problems I run into. The CAS Common Chemistry (Q18907859) set contains many salts and inorganic compounds. Not all 500 thousand have an InChI to use to match up with Wikidata. For those that do have an InChI, the stereochemistry is not always defined. I will discuss next week with the CAS team how to proceed. --Egon Willighagen (talk) 12:05, 2 April 2021 (UTC)

I created the obligatory shape expression: https://www.wikidata.org/wiki/EntitySchema:E299 --Egon Willighagen (talk) 21:39, 2 April 2021 (UTC)

So, if there is no match, it may need inspection. There can be multiple reasons why the match is not existing, some of a cheminformatics nature or because the CAS database has a different representation of the chemistry than Wikidata does. At this moment I find 7890 CAS numbers in Wikidata where the InChIKey match with the CAS Common Chemistry says it should be something else. One example:

CAS in Wikidata (Q72447099 / OSAJVUUALHWJEM-UHFFFAOYSA-N) does not match: expected 503065-10-9 but found 52217-60-4

Tomorrow I have a meeting with the CAS team and I'll ask if I can share the other 7889 too. --Egon Willighagen (talk) 18:07, 4 April 2021 (UTC)

CAS number mismatches (almost 7900) are now reported here: https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry/CAS_Validation_Results --Egon Willighagen (talk) 11:10, 6 April 2021 (UTC)

There seems to be something wrong. The line with Q159683 (Citric acid) states: expected 141633-96-7 but found 77-92-9. But 77-92-9 is a fine entry and it matches the InChi key, 141633-96-7 is something different (Citric acid polymer) with the same key. --SCIdude (talk) 14:46, 6 April 2021 (UTC)

Yes, can happen a few more times, I'm afraid. Another collaborator also found this issue in the data set we got. These I cannot automated check, but are reflecting a limitation of the CAS internal system. I've started a section to report these things here: https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Chemistry/CAS_Validation_Results#Incorrect_error_reports Please do add things there, and I will add exceptions to my script so that they won't show up in later lists (and I'll make the CAS team aware of that list too). --Egon Willighagen (talk) 21:27, 6 April 2021 (UTC)

Script to check against CAS Common Chemistry

CAS has recently provided an API that is free to use. Sending a HTTP request yields a JSON object with name, CAS-No. and SVG code for structural formula.

I have written a short Python script to check lists against the API.

If you provide me a list (csv for example) of chemical names (trivial/trade or IUPAC) and/or CAS-Nos., I can check many hundreds of entries against the CAS DB. If I check the name and the number and list them in four columns (name request, CAS request, name reply, CAS reply), discrepancies might be easily cleared using a spreadsheet with some bool columns and filters.

Bonus: The structural formula at CAS can also be included in the output. This can be used to retrieve the structural formula recorded at CAS.

Apart from the obvious error-checking, I had the idea to add a Wikidata item like "CAS structure". This would allow for many wd entries to be checked for errors in their structural formulae by comparing the structural formula in wd to the one registered at CAS when one is editing/checking the wd entry anyways.

The script can of course be easily transcribed to JS, or Lua, or whatever. It's only a few lines. Is anyone interested? --Nothingserious (talk) 21:10, 8 April 2021 (UTC)

Edit: Just read the section above this. Seems a lot more sophsticated

but maybe this can be helpful, nonetheless. --Nothingserious (talk) 21:16, 8 April 2021 (UTC)

Hi, no worries. Mind you, the script I have does a few more things, but CAS did provide me with a spreadsheet. They know there are more people hoping for a download functionality, and that may happen. Do, I hope you've seen the results, particularly the mismatches too? I think I'll finish the batch of some 120 thousand matching CAS number (well, based on the InChIKey) this weekend, just in time for the Monday presentation at the American Chemical Society Spring meeting. But there remains after that a lot to be done. The set of almost 500 thousand CAS registry numbers also contains many compounds with a SMILES/InChIKey that also need matching. Based on the name. But that is very error prone. A script that would create input for Magnus' Mix'n'Match would be very interesting. Grtz, Egon --Egon Willighagen (talk) 07:47, 10 April 2021 (UTC)

Wikipedia - Wikidata mismatches?

So, everyone who started looking at the missing CAS RNs will immediately recognize some patterns. One pattern is that sitelinks between Wikipedia and Wikidata are not always correct. Wikipedia may be more stereo-specific, or less. There are multiple solutions to solve this: 1. make the least stereospecific page more specific, 2. make a new Wikidata page to match the English Wikipedia (and make the appropriate links), 3. accept as is. And probably a few more. The Wikipathways teams has solved a number of these kind of issues over the years. What do the two WikiProject Chemistry teams think the best course of action is? Cross-post: https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry#Wikipedia_-_Wikidata_mismatches%3F --Egon Willighagen (talk) 11:44, 10 April 2021 (UTC)

diallyl diglycol carbonate (Q409703)

This item refers to a polymer according to instance of (P31) polymer (Q81163) and its sitelinks. However, the CAS RN 142-22-3 and other content refer to its monomer. The correct CAS RN for the polymer is 95567-48-9. This CAS number was contained in earlier versions of this item. What should be done?

Converting the item consistently to the polymer
Moving the sitelinks to a new item and convert the existing item to the monomer

--Leyo 09:22, 28 April 2021 (UTC)

If two concepts are visible in one item, especially with external links, we usually make two items. --SCIdude (talk) 06:12, 30 April 2021 (UTC)

My question was if the current item (Q409703) should be converted to consistently represent the polymer or the monomer. The content of the second item will be dependent on this decision. --Leyo 19:22, 30 April 2021 (UTC)

Jacke wie Hose. Such a decision may sometimes depend on the amount of existing links to the item that would need to be repaired, but here only maintenance pages link to it. It is less work to create the polymer item and move the sitelinks there than moving all compound claims to a new item. Let me do it. --SCIdude (talk) 15:14, 1 May 2021 (UTC)

Okay, thank you. --Leyo 19:42, 2 May 2021 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 16:10, 11 May 2021 (UTC)

~1~ and _1_ in labels

~1~ and _1_ (or with any other digits) in labels stand for super- and subscript digits, respectively. Hence, such occurrences should be replaced by the correct unicode digits (¹²³⁴⁵⁶⁷⁸⁹⁰₁₂₃₄₅₆₇₈₉₀), as in this example. As there are many affected items (to be restricted to items on chemicals), this should be done by an automated task. Any thoughts? --Leyo 22:18, 29 April 2021 (UTC)

The labels should be fixed, not by using sub/superscript in the English label, but by using the bare bones ASCII version. Unicode can go in an alias. My bot could do it when it's finished with other things. If someone wants to do it earlier please go ahead. --SCIdude (talk) 06:17, 30 April 2021 (UTC)

What is the bare bones ASCII version? --Ameisenigel (talk) 07:36, 30 April 2021 (UTC)

"N(1)" for "N¹" for example. BTW, ChEBI gives the conversion, example: https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:142660. --SCIdude (talk) 08:05, 30 April 2021 (UTC)

Thanks --Ameisenigel (talk) 08:54, 30 April 2021 (UTC)

It should be done in the opposite way: Correctly formatted name, i.e. using unicode digits, for the label, pure ASCII version(s) for the alias(es). --Leyo 19:17, 30 April 2021 (UTC)

acetyl hexamethyl tetralin (Q2409972)

This compound has 2 valid CAS numbers according to the EU Risk Assessment Report and the CosIng database, namely 1506-02-1 and 21145-77-7. In the CAS Common Chemistry database, I can't see any difference when it comes to the structure (e.g. their InChI is identical). Any idea why the Chemical Abstracts Service has not declared one of the CAS numbers as obsolete? --Leyo 06:57, 4 May 2021 (UTC)

The structure has one undefined stereo center. One possibility is that 21145-77-7 stands for the racemate (it refers to the patent), and 1506-02-1 is general entry. --SCIdude (talk) 15:16, 4 May 2021 (UTC)

Class ontology

There is a draft proposal for a project sub-page about a referenced class ontology: User:SCIdude/Modeling#Chemical_ontology. If noone objects I would move this under Wikidata:WikiProject_Chemistry/Entity_Classes, and update the visualization weekly (manually). People can add references that they used to extend the ontology. A lot can be added from the blue book alone I guess. --SCIdude (talk) 06:54, 11 July 2021 (UTC)

I support that and would join the initiative. This is extremely important for our LOTUS project[6] and we are currently limited by existing ontologies and associated mapping tools to annotate structures. Bjonnh (talk) 14:12, 19 July 2021 (UTC)

Wikidata:WikiProject_Chemistry/Chemical ontology or classification would be probably better IMHO. Discussion page of that sub-page would be a good place to centralise discussions about chemical classification which are now scattered on different discussion pages. Wostr (talk) 16:32, 25 July 2021 (UTC)

It is at Wikidata:WikiProject_Chemistry/Chemical_classification. --SCIdude (talk) 14:58, 28 July 2021 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 15:10, 11 September 2021 (UTC)

Class patterns

In order to automatize compound classification it is tempting to use SMARTS notation (P8533). Naively every compound class would have such a string, and a bot would use these SMARTS patterns to decide if a compound belongs to the set. However, the SMARTS language as defined by Daylight cannot express exclusivity. Example: there is no way to specify a pattern that hits compounds having component A and B only, or having one A, between 2 and 5 B, an arbitrary number of C, but nothing else. Of course this can be implemented by using set logic in the bot. The problem is rather, there is no way to express it in a SMARTS string, which would be associated with a Wikidata compound class item.

My proposal is therefore to extend the SMARTS language with two operators, both would act on the component level of SMARTS. Let `A`, `B`, `C` be SMARTS patterns. Then `(A)_1` would match a component that matches A once, `(B)_(2-5)` would match a component that matches B between twice and five times, and `(C)_n` would match a component that matches C at least once. Also `#(X.Y.Z)` would match a molecule that matches the components X,Y,Z exactly, disallowing any other or superfluous atoms. Example: one definition of straight-chain fatty acid would be:

#(([CR0;D1]C)_1.([CR0;D2](C)C)_n.([CX3](=O)[OX2H1])_1)

Has someone seen something like this already? Am I reinventing the wheel? Thanks for your comments. --SCIdude (talk) 08:26, 19 July 2021 (UTC)

That would be useful, do you have any idea how this could be implemented? Bjonnh (talk) 14:23, 19 July 2021 (UTC)

The plan is to release python code that grabs the ontology and associated patterns from Wikidata and tries to match structure(s), returning the matching ontology nodes (or probably the matching leaf nodes and their path/s to the root node) and associated biosynthetic pathway information (e.g. https://www.wikidata.org/wiki/Q107621952#P361) which will all come from the Gene Ontology. Of course anyone could write their own software using that WD data. --SCIdude (talk) 09:37, 25 July 2021 (UTC)

PS. Actually, due to WD string size restrictions, the patterns would need to exist outside of Wikidata, in the end. WD would hold copies of the ones that fit the restrictions. --SCIdude (talk) 13:02, 25 July 2021 (UTC)

I don't think that we can or should modify the SMARTS notation. If we need to adapt the notation to WD needs, it should be properly qualified — so WD-modified SMARTS should be either added using a qualifier to SMARTS property or there should be a precise qualifier to SMARTS property with WD-modified SMARTS saying that the SMARTS notation added is our modified version, not the standard one. Wostr (talk) 16:26, 25 July 2021 (UTC)

I understand that. Probably the right procedure would be to develop the extension externally and write a paper. --SCIdude (talk) 05:13, 26 July 2021 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 15:09, 11 September 2021 (UTC)

diphenylphosphoryl azide (Q723781)

There seems to be a conflation in this item: chemical formula and chemical structure belong to different compounds. --Ameisenigel (talk) 07:16, 11 October 2021 (UTC)

@Ameisenigel What exactly is wrong? The formula C12H10N3O3P and diagram appear to match. Graeme Bartlett (talk) 00:07, 16 December 2021 (UTC)

@Graeme Bartlett: But actually the statement for the chemical formula is C12H10N3OP. ChemSpider and PubChem seem to belong to C12H10N3OP. CAS and ECHA seem to belong to C12H10N3O3P. --Ameisenigel (talk) 07:42, 16 December 2021 (UTC)

Please always start from the InChI of the item. The InChI is about the O1 compound, the O3 is missing in WD so it needs to be created. Let me do that--SCIdude (talk) 08:13, 16 December 2021 (UTC)

I think that this discussion is resolved and can be archived. If you disagree, don't hesitate to replace this template with your comment. SCIdude (talk) 08:24, 16 December 2021 (UTC)

Validation of CAS numbers; collaboration with Wikipedia?

Hi all, for the past few months we have been talking to a source of trusted CAS number information, and likely we cause this to confirm many CAS numbers, similar to CAS Common Chemistry (Q18907859). Together with this source, we're exploring how to this data into Wikipedia and Wikidata, and we have been talking about using ChemBox to pull out the information from Wikidata (which I think it does for various other fields already. On the Wikidata side, I want a clear data model: We don't just want to give the CAS, but also this new source as reference, when it was added/verified, etc. Importantly, I am also thinking about indicating on what basis the statement was made. For example, was this based on InChI(-Key) matching? The model should ideally say this, so that we can detect items where the InChIKey changed after the match was done. We're likely talking a few hundred thousand CAS registry numbers, so I like to work out these details early. We may use the bots used for proteins/genes.

Notified participants of WikiProject Chemistry --Egon Willighagen (talk) 07:21, 11 October 2020 (UTC)

Now cross-posted as Validation of CAS numbers; collaboration with Wikidata?. --Egon Willighagen (talk) 07:51, 11 October 2020 (UTC)

So, which part of a CAS entry is definitive? The InChi key, the name, the 2D structure, any of the links or synonyms? I ask because, usually, there are multiple mismatches between any of these properties, and this is why I stopped relying on their entries. --SCIdude (talk) 08:11, 11 October 2020 (UTC)

Agreed. I hope this will become public soon. --Egon Willighagen (talk) 09:49, 11 October 2020 (UTC)

After several years trying to clean some chemical data, I have some kind of action list:

Create a policy to ensure a correct definition of what should be included in your data set
- how to handle tautomer (one entry for both forms, one entry for each and in that latter case how to manage the data from databases which are not doing the difference between the tautomers,...)
- how to handle partially defined stereoisomers
- how complexes (ligand bond) and salt (ionic bond) should be defined
Use a structural identifier as primary key for identification like InChI or InChIKey
Generate a list of your database identifiers with the structural identifier (for example Wikidata Q number /InChIKey)
Wikidata can't be a source, so you can't upload data without a reference, this is the key factor to allow external persons to trust data from WD: they don't have to trust Wikidata, they can trust the reference related to each value in WD.
From your list defined above start to fill a table with other identifiers by matching the structural identifier. For example, if you want to link your identifier with the identifier of the PubChem database, you have to find which entries in your list and in PubChem have the same structural identifier. Problems appear if your policy concerning tautomers or way of describing complex/salt is not similar or if the other database is not strict with the rule one structural identifier = only one database identifier.
Once your table is finished, with the list of identifiers and their related reference,then you can import the data into wikidata.
Finally, periodically, using your table as master date, you check the change of identifier values in WD and if you find an difference, then your investigate the origin of the change.
Less frequently than the previous point, your check your table against the external databases to see if some changes occur in their data set.

The reason to use an intermediate table is to have the possibility to perform different checks before the importation of datat to WD: to ensure that each external identifier is unique (if not the case, the data has to be curated in the external database),...

I would not start with mass impot of CAS values before 1) we curate the WD data set: as long as we have constraint violations for our InChIKey/InChI values, this means we have duplicates or wrong defined Q numbers, 2) as CAS registry databse is not providing a n InChIKey/InChI value for ech CAS number, we need to rely on other databases to create that relation. So we need to curate first other database to ensure the uniqueness of their values. Only after these two steps we can start to consider CAS numbers. Snipre (talk) 21:34, 13 October 2020 (UTC)

@Egon Willighagen: To answer your question, I would propose to use your new data set of curate an established and well known database, as example PubChem, and then using the curated CAS number in that databse, to import them into WD. Why ? Because WD can't be a source. We need to rely on external documents or databases, we need references for the values imported into WD. WD should be the connection between references and other authorities, not becoming the reference. Snipre (talk) 21:40, 13 October 2020 (UTC)

Regarding the Snipre's list above about stereoisomers/tautomers/etc. I'd also say that it has long been a problem in WD with no proper solution. Also: we till have no clue how to classify chemical entities. Without solutions to these problems, no real work can be done here. Wostr (talk) 14:33, 14 October 2020 (UTC)

Notified participants of WikiProject Chemistry the news is out: CAS updated Common Chemistry and it now contains almost half a million registry numbers, https://www.cas.org/resources/press-releases/common-chemistry . I already have scripts to validate Wikidata that I can now share. Will do so as soon as possible, but a bit under the weather this week. --Egon Willighagen (talk) 15:58, 17 March 2021 (UTC)

@Egon Willighagen: Are you registered at the site? Besides the API do they provide mappings or lists of deprecated IDs? It would be a relief if we had the deprecations, as I suspect a lot of them in WD. --SCIdude (talk) 17:37, 17 March 2021 (UTC)

With the new subset you can search on the deprecated id and it will return the current one. eg https://commonchemistry.cas.org/detail?cas_rn=12673-75-5 returns 1343-98-2 . Graeme Bartlett (talk) 04:46, 18 March 2021 (UTC)

Accessing one URL per CAS is not feasible to check all statements we have. --SCIdude (talk) 07:37, 18 March 2021 (UTC)

I received a spreadsheet with the data, and been comparing with them. I need to rerun the comparison this weekend for the latest data, and will make the results available as soon as possible. --Egon Willighagen (talk) 14:07, 18 March 2021 (UTC)

I've been checking CAS numbers in zhwp for some time, and correct numbers are marked with the template {{cascite|correct|CAS}}. A robot is creating CAS number redirects to these articles. However, some articles may have more than one number, such as copper(II) sulfate and its hydrates, or ML₄X₂ and its ionic form [ML₄]²⁺(X⁻)₂. Collaboration with Wikipedia is a good idea but discussion about such cases is needed. --Leiem (talk) 02:24, 18 March 2021 (UTC)

Links in Wikipedia chemboxes

@Egon Willighagen: Would be possible to store the information in each item on whether a corresponding entry in the Common Chemistry database is available? This information could then used in Wikipedia chemboxes to decide whether a CAS RN is linked to the Common Chemistry database or remains unlinked. --Leyo 20:09, 18 March 2021 (UTC)

Yes, I think that's the idea: we use the References approach to mark the CAS Registry Number (Q102507) as confirmed by CAS Common Chemistry (Q18907859), with date, etc. I have a meeting with the CAS team next week again. Basically, steps: 1. make a ShEx for the reference model, 2. update my code to determined the confirmed CAS numbers, 3. create QuickStatements to do it. The first is useful (see A protocol for adding knowledge to Wikidata: aligning resources on human coronaviruses (Q105037759)): when we defined how the annotation should be there, multiple people could work on it, and future reruns (e.g. with new CAS Common Chemistry (Q18907859) releases) can easily check if the validation has already been done. --Egon Willighagen (talk) 07:05, 19 March 2021 (UTC)

I asked for assistance on en:WP:VPT#Accessing Wikidata references as plain text on how to implement a switch for (non-)linking in en:Template:Chembox CASNo/format based on the presence of the CAS Commons Chemistry data in the items. --Leyo 14:27, 26 April 2021 (UTC)

The link is now omitted for chemicals without a referenced CAS number in their WD item (chembox, drugbox). However, so far it does not seem to be possible to consider only references to the CAS Common Chemistry database. Hence, there are still dead links, but at least for fewer chemicals. --Leyo 21:54, 5 May 2021 (UTC)

Awesome! Any list of CAS links broken in Wikipedia I can pass to the CAS team. --Egon Willighagen (talk) 09:20, 6 May 2021 (UTC)

Local bot on DE:WP

Hy! I just wanted to note that I have prepared a local bot on DE:WP that checks and embeds the links to the CAS web entries (done via CAS API call), see the local notification. Furthermore, the bot notes the corresponding Q-numbers of all CAS numbers as template parameters within an article (in case the CAS number does not match that of the lemma itself). This bot could also run on EN:WP or others to solve the remaining problems like links to non-existing CAS web entries. Here I wanted to note that links to such CAS web entries are also generally generated on Wikidata, even if these CAS web entries don't exist. However, this could be fixed too. Regards, Uwe Martens (talk) 17:28, 14 May 2021 (UTC)

Bot to populate missing GHS data from pubchem LCSS

I've noticed that a lot of chemicals are missing the GHS data (this has been a little annoying because I've written some custom software to generate labels for chemical bottles, based on data from here). I'd like to write a bot to take the GHS data (and possibly other things too?) from the pubchem laboratory chemical safety summary (LCSS) dataset, and put it into wikidata.

Unfortunately the pubchem pug_rest API doesn't seem to expose the GHS data in particular, so it would have to come from the less-structured pug_view API (or more accurately, the published dumps of LCSS pug_view data). I've already written a series of XSL transforms that take that data and turn it into something a bit more usable.

Anyway, I hope this idea is agreeable, and I am looking for some input on how to go about this without stepping on anyone's toes.

ChemHobby (talk) 06:28, 30 November 2020 (UTC)

Concerns here are mostly about duplicate items or claims so please check the existing data and property constraints first before writing. --SCIdude (talk) 07:27, 30 November 2020 (UTC)

I think, at first, I would have it only add data to items that already exist. ChemHobby (talk) 16:48, 30 November 2020 (UTC)

Yes but you shouldn't add duplicate claims to existing items, as well. Just a heads up. --SCIdude (talk) 09:14, 1 December 2020 (UTC)

No, no, no. PubChem GHS data is usually labelled with source 'Regulation (EC) No 1272/2008', but this data is not a valid EU GHS! It's more similar to US GHS than to EU GHS. There is also ECHA database (CLI) from which there is also no possibility to import correct data to Wikidata. I did not know of any database from which one can import valid GHS data to WD. Wostr (talk) 12:32, 2 December 2020 (UTC) BTW which data set from [11 available for ethanol] you would like to import? From 6 EU GHS data I see that no set is a valid EU GHS data. There are also 3 JP GHS datasets, but I can't tell right now if that is compatible with JP GHS regulations. Wostr (talk) 12:35, 2 December 2020 (UTC)

I don't understand. Can you elaborate on why the data is not valid? Surely at least the data labelled 'Regulation (EC) No 1272/2008' can go against Q2005334? ChemHobby (talk) 04:41, 3 December 2020 (UTC)

It cannot. We have GHS labelling in WD, not GHS classification. P-phrases in PubChem are automatically added in number exceeding the limit of P-phrases for EU GHS. Sometimes there are H phrases that should be omitted in labelling. Sometimes the information of additives or impurities is lacking. What's more, I don't think that data from ECHA can be legally imported to Wikidata. Wostr (talk) 15:19, 3 December 2020 (UTC)

Hmm.... What about starting with importing the signal word and pictograms, and leaving H and P phrases as unknown value for now? Then maybe later the bot can populate H/P statements by applying the appropriate rules for labelling. Or, we could take the data from the table 3.1 here which specifically includes both the classification and labelling H codes, as well as signal word/pictograms. Again P statements could be left as unknown value. ChemHobby (talk) 15:47, 3 December 2020 (UTC)

It is not possible to apply any rules for P-phrases. Current constraints for safety classification and labelling (P4952) do not permit partial labelling – quite correctly. EU GHS labelling can be added manually using proper sources or semi-automatically by making a spreadsheet with data from such sources and adding this data using QS. I know no other possibility right know for EU GHS. Wostr (talk) 13:34, 4 December 2020 (UTC)

Alright... In that case, what is a 'proper source' to use for this? ChemHobby (talk) 19:10, 4 December 2020 (UTC)

There are databases like GESTIS, safety data sheets of trusted companies. Depends on the jurisdiction, I think there are more sources available for e.g. OSHA GHS, because there are different rules for GHS in US. Wostr (talk) 04:10, 5 December 2020 (UTC)

Yes, GESTIS is fine, as opposed to PubChem. Have the GHS data from the Table of harmonised entries in Annex VI to CLP already been imported? --Leyo 19:48, 1 April 2021 (UTC)

Of course not, as this is not a complete set of GHS labelling. This would be sufficient for GHS classification. Wostr (talk) 23:32, 2 April 2021 (UTC)

I assume you refer to the missing P phrases. They are of lower importance compared to the other elements (hazards). --Leyo 11:48, 6 May 2021 (UTC)

In EU GHS labelling you can't automatically assign P-phrases so such databases are useless for EU GHS labelling. It would be sufficient only for GHS classification which is not yet implemented in WD. Wostr (talk) 20:56, 7 May 2021 (UTC)

The EU had good reasons for not including P phrases in the harmonised C&L. As stated above, P phrases could be left as unknown value. The associated constraints need to be adapted.

As an alternative, a separate property could be created for harmonised C&L (i.e. from Annex VI of Regulation (EC) No. 1272/2008 (Q2005334)). --Leyo 22:05, 7 May 2021 (UTC)

The existing property was created with an intended purpose to cover all possible hazardous material classifications & labellings, not only GHS/NFPA 704, so I would be against creating another property just for harmonised EU GHS. This solution has some disadvantages, because one cannot add e.g. 'specific organ' information to some H-phrases. However, it was modelled like this, because with different properties for GHS, it would be not possible to retrieve the correct set of pictograms, phrases etc. What you propose above is possible with the existing model and without creating another property. For EU GHS it can be added just like now with a main value 'Regulation (EC) No. 1272/2008 (Q2005334)', P-phrases set to 'somevalue' and an additional restrictive qualifier like criterion used (P1013) or similar that would point to 'harmonise labelling'. But the question is: would it be useful for anyone? I don't think that e.g. de.wiki or pl.wiki would use it, as it is better to use GESTIS, another reliable source or SDS, because that way one has a complete labelling info. Wostr (talk) 14:08, 11 May 2021 (UTC)

Well, its presence would be useful to find articles that should have a harmonised labelling, but don't. Until 14th ATP it should be more or less completely present in the articles (see de:Kategorie:Wikipedia:Vom Gesetzgeber eingestufter Gefahrstoff), but newly created articles might have missed. --Leyo 22:19, 11 May 2021 (UTC)

Many possible standard InChIs for the same chemical entity

I found this problem in hematoporphyrin (Q908742), but it is probably true for many porphyrin-like structures. Different databases gives different InChIs for apparently the same chemical entity. My first thought was that spatial configuration is different, but it's not the case. Depending on how you place the double bonds in the structure (which is basically arbitrary considering the delocalization), the generated InChI will be slightly different. I checked many possibilities using IUPAC InChI software and I get few different Standard(!) InChIs, every time with /b sublayer (the one that contains information about double bonds configuration). It is interesting how the PubChem's StdInChI is generated, because it does not contain /b sublayer at all (and even reproducing this InChI in IUPAC software gives an InChI with /b sublayer...).

The problem is that for this chemical entity more than one StdInChI may be correct, which should not happen at all. I propose that we should use InChI without /b sublayer (the PubChem's one) as a primary InChI (with preferred rank), the rest could be of normal rank; all with criterion used (P1013) with proper values like standard InChI for tetrapyrrole with/without /b sublayer. I don't think we should delete any InChIs, because every InChI seems to be valid. Any thoughts or comment? Or maybe I don't get this situation right? Wostr (talk) 23:56, 23 June 2021 (UTC)

Using PubChem InChis (and keys) is fine with me, it would also solve the norbornan problem. Would this be just a recommendation? Should we have a bot checking these? --SCIdude (talk) 09:32, 24 June 2021 (UTC)

Query for isotope consistency check

I recently found a number of mistake / vandalism where the number of neutrons for isotopes had been modified without anybody noticing, it seems. I found more problems using this query :

select ?isotope ?rank ?neutron_number ?at_number {
  ?isotopeclass wdt:P279* wd:Q25276 .
  ?isotope p:P31 [ ps:P31 ?isotopeclass ;
                          pq:P1545 ?rank ] .
  ?isotope wdt:P1148 ?neutron_number ;
           wdt:P1086 ?at_number .
  
   filter (?at_number + ?neutron_number != xsd:integer(?rank)  ) # check that the ranking number is the sum of the atomic number and the neutron number
        
}

Try it!

Maybe it’s a good idea to add this as a complex constraint somewhere, but where ? On the isotope (Q25276)   item ?

Also the use of the series ordinal (P1545) is not obvious because the number of the first isotope of an atom is not "1" … author TomT0m / talk page 19:27, 24 November 2021 (UTC)

I finally added it on neutron number (P1148) . I also added a constraint to check the english label is correctly "[element name]-[number of nuclides]" to catch a few more mistakes (catched some problems that way) and to make vandalism more difficult. Maybe you should add Wikidata:Database_reports/Complex_constraint_violations/P1148 to your watchlist :) author TomT0m / talk page 20:50, 24 November 2021 (UTC)

Thanks, watchlisted. (I've made a minor fix because you reversed the atomic number and neutron number properties). --99of9 (talk) 02:37, 25 November 2021 (UTC)

Wikidata talk:WikiProject Chemistry/Archive/2021

Contents

A lot of duplicate data

Tautomer/zwitterion

Non-standard InChI

GZWDer added all (most?) of the US EPA CompTox dashboard

New property proposals

Difference between CAS numbers

Q5173335

604 duplicate InChIKeys

CAS and unspecified stereochemistry

InChI strings in Wikidata missing 'InChI=' prefix

PubChem 2D structures

ChEBI and mapping type

Q3268366/Q56702552

Untangling CAS IDs

AICS obsolete

First CAS number validation results

Script to check against CAS Common Chemistry

Wikipedia - Wikidata mismatches?

diallyl diglycol carbonate (Q409703)

~1~ and _1_ in labels

acetyl hexamethyl tetralin (Q2409972)

Class ontology

Class patterns

diphenylphosphoryl azide (Q723781)

Validation of CAS numbers; collaboration with Wikipedia?

Links in Wikipedia chemboxes

Local bot on DE:WP

Bot to populate missing GHS data from pubchem LCSS

Many possible standard InChIs for the same chemical entity

Query for isotope consistency check

Navigation menu

Wikidata talk:WikiProject Chemistry/Archive/2021

A lot of duplicate data

Tautomer/zwitterion

Non-standard InChI

GZWDer added all (most?) of the US EPA CompTox dashboard

New property proposals

Difference between CAS numbers

Q5173335

604 duplicate InChIKeys

CAS and unspecified stereochemistry

InChI strings in Wikidata missing 'InChI=' prefix

PubChem 2D structures

ChEBI and mapping type

Q3268366/Q56702552

Untangling CAS IDs

AICS obsolete

First CAS number validation results

Script to check against CAS Common Chemistry

Wikipedia - Wikidata mismatches?

diallyl diglycol carbonate (Q409703)

~1~ and _1_ in labels

acetyl hexamethyl tetralin (Q2409972)

Class ontology

Class patterns

diphenylphosphoryl azide (Q723781)

Validation of CAS numbers; collaboration with Wikipedia?

Links in Wikipedia chemboxes

Local bot on DE:WP

Bot to populate missing GHS data from pubchem LCSS

Many possible standard InChIs for the same chemical entity

Query for isotope consistency check

Navigation menu

Search