User:SCIdude

From Wikidata
Jump to navigation Jump to search
"things, not strings"

Amit Singhal

Hello, I'm trying to improve the molbio part of Wikidata by manual and batch editing. Although being a software dev (main language C++), I have prepared many books for Project Gutenberg (Q22673), contributed in the years 2006-2012 to German Wikipedia (Q48183) (as User:Ayacop), and also have biocurated extensively for GOA (Q28018111) and Reactome (Q2134522).

Ralf Stephan (Q67363620)

Authority control
Babel user information
de-N Dieser Benutzer spricht Deutsch als Muttersprache.
en-3 This user has advanced knowledge of English.
fr-1 Cet utilisateur dispose de connaissances de base en français.
la-1 Hic usor simplici lingua Latina conferre potest.
ru-0 Этот участник не владеет русским языком (или понимает его с трудом).
it-0 Questo utente non è in grado di comunicare in italiano (o lo capisce solo con notevole difficoltà).
This user is a member of WikiProject Microbiology.
This user is a member of WikiProject Molecular biology.
This user is a member of WikiProject Chemistry.
This user is a member of WikiProject COVID19.
Users by language

Current ideas:

Current TODO list:

  • add refs to rotavirus literature main subjects / uses that we did
  • use P10228 (facilitates flow of)

Also:

  • instance of protein fragment with 'of"
  • construct accessible pipe for verifying TCDB ID of proteins / families
  • complexes from PRotein Ontology
  • MeSH protein entries are usually species-independent. Check heuristically and use
  • connect Reactome entities with existing families
  • Arabidopsis and Dictyostelium import
  • PMCREF: use https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pmc&linkname=pmc_refs_pubmed&retmode=json&id=2685584
  • use "substrate of"
  • subc-of-enzyme inhibitor + phys. interact. with XY --> inhibitor of XY
  • subc-of-agonist + phys. interact. with XY --> agonist of XY
  • use subc to describe pgroups exactly
  • UniProt protein families
  • sync prot-->part of-->enzfam if exact molfunc is annotated
  • for every GO complex, list parts and make subunit families
  • multifunctionnal enzymes?
  • some proteins encoded by same gene, mark as variants
  • interpro and superfamily with description "InterPro Domain"---> really are domain superfamilies
  • check IPR items for correct Pfam (via IPR), also move Pfam from other item
  • GO items with changed label are suspected to be WPedians fumbling result
  • if TCDB fam X subclass-of TCDB fam Y --> missing reference dbhierarchy heuristics
  • Reactome candidate sets missing "has part"
  • peptidases with endopep func
  • IUPHAR IDs without Wikidata, anyone?
  • IUPHAR family IDs, anyone?
  • membranome classes https://membranome.org/
  • add "stated as" qual. to ChEBI ids of amino acids / their zwitterions; make special contraint including this
  • BindingDB ids?
  • missing OMIM phenotypes, e.g. 1?
  • OMIM phenotypic series, see their FAQ
  • orthology group ids/groups, see bot issue
  • do all ions have charge in their label?
  • next MONDO sync?
  • german labels from Brockhaus
  • remove em-dashes from labels
  • items with dewiki but without enwiki and en-label
  • industry processes don't have-parts all reactants/modifiers

In the manual attempt to create/curate WD items of cleavage products (fragments) of proteins I worked around preproinsulin (Q7240673), angiotensinogen (Q267200), Ghrelin and obestatin prepropeptide (Q66216544), proglucagon (Q66310097), proopiomelanocortin (Q418896), Cerebellin 1 precursor (Q21115606), Natriuretic peptide B (Q422288), Endothelin 1 (Q66361339), Apelin (Q2386988), Tachykinin precursor 1 (Q21123080), Secretogranin II (Q21105303), Thymosin beta 4 X-linked (Q7799643), Vasoactive intestinal peptide (Q66499176), VGF nerve growth factor inducible (Q21122290), augurin precursor (Q66535298), Chromogranin A (Q3698322), Cathelicidin antimicrobial peptide (Q411181)

What I'm doing is roughly this:

  • if gene and protein is in one item, duplicate to get separate items (moving sitelinks first to the protein)
  • remove wrong statements on either (e.g. no PDB/protein IDs/GOA function/localization annotations on genes), make sure the gene has at the most GO process annotations
  • create/check all relevant fragment objects, move statements to the resp. item: EnsemblP should be on prepro/pro
  • separate out aliases to resp. objects
  • add "has part" with all fragments to prepro object
  • complete "encodes/encoded by" everywhere
  • add "exact match" qualifier to fragment UniProt like e.g. https://www.uniprot.org/uniprot/Q9UBU3#PRO_0000019202
  • add Reactome, ChEBI, ChemBL, IUPHAR IDs to fragment if existing (Reactome labels like GENE(1-100) also to fragment aliases)
  • add "part of" Reactome process or reaction if missing
  • (maybe) move GOA function annotations to resp. fragment if applicable

misc[edit]

{{section resolved|~~~~}} {{Q|21105303}}