Wikidata talk:WikiProject Chemistry/ChemID

From Wikidata
Jump to navigation Jump to search

Database licences[edit]

These databases do not have an explicit public domain statement, and while ChEMBL and ChEBI are at least CC, they are not as liberal as CCZero. You must get a statement from them that they allow extracting those IDs from those two database and putting them as CCZero in WikiData is OK. I have contact with both teams, and can assist. KEGG is closed, but they too may be willing to allow WikiData to add the KEGG compound and drug identifiers. Egon Willighagen (talk) 23:00, 6 December 2013 (UTC)[reply]

@Egon Willighagen: Thanks to highlight the problem of the licence. I didn't check this important problem. So we have to organize some discussion with these databases to see what we can do with their ID. In my opinion the best solution is to prepare a clear explanation of what we want to do and then to involve the wikidata development team before contacting the databases managers: an institution-institution discussion is better to ensure that the agreement will last even if the personns who take part to the discussion disappear. Snipre (talk) 08:19, 8 December 2013 (UTC)[reply]
Database Licence Agreement
Wikidata CC0 Example
ChEBI CC BY
PubChem US gov licence ?
ChEMBL CC BY-SA
KEGG (public part) ? Need a agreement
ChemSpider ? ?
CAS number Copyright ~7000 CAS numbers are free of any agreement
UNII US gov licence ?
DrugBank own licence Need a agreement because of the non-commercial term

ChEBI[edit]

Text about licence for ChEBI data :

  • Download: here
  • Download format: SDF
  • Licence: CC BY

PubChem[edit]

  • Download: here
  • Download format: SDF
  • Licence: Text about copyright for PubChem data (see here):

Fair Use Disclaimer

Databases of molecular data on the NCBI FTP site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein. However, some submitters of the original data may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims and, therefore, cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.

KEGG[edit]

KEGG is a closed database but a small part of the data is open data (see this information). We just need to get the agreement to extract data from that part, see data here. we can start from there and see if we can extend the agreement to other chemicals from the close dpart of the database. Snipre (talk) 09:01, 8 December 2013 (UTC)[reply]

ChemSpider[edit]

Copyrighted: see here for terms of use. Snipre (talk) 09:22, 8 December 2013 (UTC)[reply]

CAS number[edit]

There is already some discussions between CAS registry database and en:WP. A list of free CAS numbers is available at http://commonchemistry.org/

UNII[edit]

  • Download: here
  • Download format: text file
  • Licence: Text about copyright for UNII data (see here):

Government information at NLM Web sites is in the public domain. Public domain information may be freely distributed and copied, but it is requested that in any subsequent use the National Library of Medicine (NLM) be given appropriate acknowledgement. When using NLM Web sites, you may encounter documents, illustrations, photographs, or other information resources contributed or licensed by private individuals, companies, or organizations that may be protected by U.S. and foreign copyright laws. Transmission or reproduction of protected items beyond that allowed by fair use as defined in the copyright laws requires the written permission of the copyright owners. Specific NLM Web sites containing protected information provide additional notification of conditions associated with its use.

DrugBank[edit]

  • Download: here
  • Download format: xml
  • Licence: Text about copyright for DrugBank data (see here):

DrugBank is offered to the public as a freely available resource. Use and re-distribution of the data, in whole or in part, for commercial purposes requires explicit permission of the authors and explicit acknowledgment of the source material (DrugBank) and the original publication (see below). We ask that users who download significant portions of the database cite the DrugBank paper in any resulting publications.

  • Open data sets: here
  • The DrugBank Open Data datasets are public domain datasets that can be used freely in your application or project (including commercial use). It is released under a Creative Common’s CC0 International License.

Available data in external databases Table[edit]

Hi,

Looking at the table in the section called Available data in external databases, it needs an explanation of what it is signifying, I think I know what the the x's mean but without a clear statement of what the table means it is difficult to contribute. --The chemistds (talk) 12:05, 15 April 2014 (UTC)[reply]

Done Snipre (talk) 08:48, 17 April 2014 (UTC)[reply]

Where is this project up to?[edit]

Hi all. A collaborator and I are interested in this initiative. Can someone in the know comment on where it is up to? @Snipre, Egon Willighagen, Almondega, The chemistds: --99of9 (talk) 06:22, 26 March 2018 (UTC)[reply]

@99of9: No real progress: too few contributors working on chemical topics and no real coordinated work concerning chemicals. Personally I am working on curating data about chemicals and more especially on solving duplicate conflicts mainly on CAS numbers (see here the list of conflicts for CAS numbers). But before starting more work on data reconciliation we need a better policy concerning special chemicals like keto-enol pairs: how do we treat those compounds ? Like 2 different compounds so 2 items or like one compound with 2 SMILES/InChI/InChIKey values ?
If you want to start to work on identifiers, perhaps you should first prepare a policy about what are the rules to consider a chemical as an identified chemical allowing the creation of a WD item.
Just as starting job, try to solve the current violations under Wikidata:Database_reports/Constraint_violations/P235#"Unique_value"_violations or Wikidata:Database_reports/Constraint_violations/P234#"Unique_value"_violations: if we consider that InChIKey or InChI as absolute identifiers, we shouldn't have these violations. Snipre (talk) 00:01, 31 March 2018 (UTC)[reply]
And once the policy is adopted, we need to curate data before starting to add new data to existing items or even creating new items: uncontrolled data imports in WD created a lot of duplicates or addition of data to wrong items because they were not clearly identified. This was what I stopped to work on this initiative. And now I am struggling to find good authority sources to be able to define clearly what is the structure of some compounds because some IDs like CAS numbers are often used to define mixture or even undefined composition of chemicals. Typical examples are drugs where few information are available about the correct stereo structure present in the drug: is it a mixture of stereoisomers or a pure form of stereoisomer ? Snipre (talk) 00:15, 31 March 2018 (UTC)[reply]
@99of9: There is a contradiction in licence term for Drug Bank: your link indicates a CC0 licence but the page terms of use, section 13, indicates that Your access to DrugBank Content on the Platform is provided under, and subject to, a Creative Common’s Attribution-NonCommercial 4.0 International License. Clarification is needed. Snipre (talk) 13:10, 5 April 2018 (UTC)[reply]
@Snipre: Most data, and the website itself are CC-SA-NC-4.0, but they are offering just two datasets which are released as CC-0. This is a reasonably sensible way to structure it, because the stuff they are releasing under CC-0 is the identifiers, which is exactly what we need to link to their pages, even if the data on those pages is not PD. --99of9 (talk) 00:20, 6 April 2018 (UTC)[reply]

ChemID constraint violations: not so bad[edit]

Two week ago an extension of the RDF was announced that makes it possible to access constraint violations available via the query service... I played with it and created a query to list all violations of Wikidata property to identify substances (Q19833835):

SELECT ?item ?itemLabel ?prop ?propLabel ?violation ?violationLabel ?constraint ?class ?classLabel WITH {
  SELECT DISTINCT ?item ?z1 ?s ?y WHERE {
    ?s wikibase:hasViolationForConstraint ?y.
    ?item ?z1 ?s .
  }
} AS %RESULTS {
  INCLUDE %RESULTS
  ?prop wikibase:claim ?z1 ;
        wdt:P31 wd:Q19833835 .
  OPTIONAL { ?y ps:P2302 ?violation }
  OPTIONAL { ?y pq:P1793 ?constraint }     
  OPTIONAL { ?y pq:P2308 ?class }     
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it!

The good news is that the list is a lot shorter than I expected... less than 400 violations (and some interesting, like ChemIDs for works :). BTW (related), I also wrote up some thoughts and why I contribute to the ChemID project in my blog: http://chem-bla-ics.blogspot.com/2018/08/compound-class-identifiers-in-wikidata.html --Egon Willighagen (talk) 14:53, 18 August 2018 (UTC)[reply]

Query of chemicals having one InChIKey and the associated English article if available[edit]

SELECT * WHERE {
  ?compound wdt:P31 wd:Q11173 ;
            wdt:P235 ?inchikey.
  OPTIONAL { ?compound wdt:P231 ?cas } . 
  OPTIONAL { ?article schema:about ?compound; schema:name ?title; schema:isPartOf <https://en.wikipedia.org/> }.
}
Try it!
@ChemConnector: To get your list of WD items with InChIKey and CAS number if available (link to English Wikipedia article), please follow the link above "Try it!", then click the button on the left bottom corner to run the query and wait 2-3 seconds. Once the data appear, you can download them using the button "Download" on the rigth, above the data. Let me know if you have some troubles. Snipre (talk) 18:41, 29 April 2019 (UTC)[reply]
SELECT * WHERE {
  ?compound wdt:P31 wd:Q11173 ;
            wdt:P235 ?inchikey.
  OPTIONAL { ?compound wdt:P231 ?cas } . 
  OPTIONAL { ?compound wdt:P234 ?inchi } . 
  OPTIONAL { ?compound wdt:P683 ?chebi } . 
  OPTIONAL { ?compound wdt:P592 ?chembl } . 
  OPTIONAL { ?compound wdt:P662 ?pubchem } . 
  OPTIONAL { ?compound wdt:P652 ?unii } .
  OPTIONAL { ?compound wdt:P715 ?zvg } .
  OPTIONAL { ?compound wdt:P661 ?chemspider } .
  OPTIONAL { ?compound wdt:P3117 ?DSSTOX } .
  OPTIONAL { ?compound wdt:P2062 ?HSDB } .
  OPTIONAL { ?compound wdt:P2057 ?HMDB } .
}
Try it!


@Snipre: Thank you. This was very useful indeed and I have run the necessary query and downloaded the file offline to review. CheersAntony Williams 16:00, 3 May 2019 (UTC)[reply]

Confirmation tinyatoxin[edit]

@ChemConnector: If you are ready to help, perhaps can you confirm me that CAS number 58821-95-7 is related to InChIKey WWZMXEIBZCEIFB-ACAXUWNGSA-N like it is defined in https://chem.nlm.nih.gov/chemidplus/rn/58821-95-7 ChemIDplus]. InChIKey WWZMXEIBZCEIFB-ACAXUWNGSA-N is a complete defined stereoisomer and I want to know if that CAS value is not used to describe a mixture od stereoisomers defined as InChIKey WWZMXEIBZCEIFB-BNTGGEEQSA-N like in ChEBI. Reaxys database links CAS number 58821-95-7 with InChIKey WWZMXEIBZCEIFB-JLNQTMMISA-N (Reaxys number 1416658). InChIKey WWZMXEIBZCEIFB-ACAXUWNGSA-N is defined as Reaxys number 18482743 without any CAS number.

WD item: tinyatoxin (Q539395), WP:en: Tinyatoxin

Thank you in advance Snipre (talk) 13:01, 1 May 2019 (UTC)[reply]

@Snipre: This is the type of thing I like to help run down. ANd, bottom line, what a mess. Looking on PubChem there are multiple chemicals with that skeleton, a number named Tinyatoxin https://pubchem.ncbi.nlm.nih.gov/#query=WWZMXEIBZCEIFB. ChemSPider has six variants http://www.chemspider.com/InChIKey/WWZMXEIBZCEIFB. The InChIKey matches the chemical for FDA, ChemIDPlus, DrugPortal AND one of the PubCHem entries. However, they are all NLM related so it can be erroneous.

The CAS Number is explicit to an individual chemical as far as I can tell and the name is Benzeneacetic acid, 4-​hydroxy-​, [(2S,​3aR,​3bS,​6aR,​9aS,​9bR,​10R,​11aR)​-​3a,​3b,​6,​6a,​9a,​10,​11,​11a-​octahydro-​6a-​hydroxy-​8,​10-​dimethyl-​11a-​(1-​methylethenyl)​-​7-​oxo-​2-​(phenylmethyl)​-​7H-​2,​9b-​epoxyazuleno[5,​4-​e]​-​1,​3-​benzodioxol-​5-​yl]​methyl ester. The ChemIDPLus structure downloaded and named is [(2S,3aR,3bS,6aR,9aR,9bR,10R,11aR)-2-benzyl-6a-hydroxy-8,10-dimethyl-7-oxo-11a-(prop-1-en-2-yl)-3a,6,6a,7,9a,10,11,11a-octahydro-2H,3bH-2,9b-epoxyazuleno[4',5':5,6]benzo[1,2-d][1,3]dioxol-5-yl]methyl (4-hydroxyphenyl)acetate so differing in ONE stereocenter. The name for the CAS structure equates to WWZMXEIBZCEIFB-CNYBVQIGSA-N I believe. The SMILES should be Oc1ccc(cc1)CC(=O)OCC3=C[C@H]5[C@H]6O[C@@]2(O[C@]6(C[C@@H](C)[C@]5(O2)[C@H]4C=C(C)C(=O)[C@@]4(O)C3)C(=C)C)Cc7ccccc7. There are a number of transformations I took to get here but I think one stereocenter needs flipping relative to ChemIDPlus based on name. Regarding CheBI one stereocenter is not defined in the stereolayer : t23-,28+,29-,32-,33-,34-,35?,36-/m1/s1 75.163.54.219 01:53, 2 May 2019 (UTC)[reply]

@Snipre: I have looked further into this re. CASRN's - Resinferatoxin stereochemistry is represented as (2S,​3aR,​3bS,​6aR,​9aR,​9bR,​10R,​11aR) vs (2S,​3aR,​3bS,​6aR,​9aS,​9bR,​10R,​11aR) for tinyatoxin for 58821-95-7. Based on this checking BETWEEN chemicals of the same class I believe the stereoform should be the same. So, my conclusion would be that the stereoform on ChemIDPlus is CORRECT.Antony Williams 19:15, 2 May 2019 (UTC)[reply]