Wikidata:Requests for permissions/Bot/Phenobot
Phenobot[edit]
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- There having been no action here in two years, and there being no response from the bot operator, I am closing this request as a procedural, non-admin action. Anyone is welcome to reopen it at any time. — PinkAmpers&(Je vous invite à me parler) 04:19, 12 March 2018 (UTC)[reply]
Phenobot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Jjkoehorst (talk • contribs • logs)
Task/s: The first step will be to improve the lineage annotation of organisms including taxon identifiers, correct species names and corresponding references using the UniProt Taxonomy database. The next step will be to include missing organisms into Wikidata and phenotypic information such as biosafety level, oxygen requirements and other features. Continuous discussion can be found here User:Phenobot/Discussion
Code:https://bitbucket.org/jjkoehorst/wikidatabots
Function details:This bot is based upon the basis of the ProteinBoxBot framework. It will use the UniProt Taxonomy SPARQL end point for data extraction and initially will work on completing existing entries as much as possible with correct names and taxon identifiers and missing species will be added to WD. For strains with existing phenotypic information this can be complemented from various sources which are currently under investigation such as GOLD or DSMZ. --jjkoehorst (talk) 15:13, 4 February 2016 (UTC) WikiProject Taxonomy has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.[reply]
- @Succu: Can you have a look at this request? --Pasleim (talk) 10:32, 5 February 2016 (UTC)[reply]
- I have some problems with the task "correct species names" NCBI is not a nomencatural database. It contains spelling errors like other databases too. And I have problems with this kind of sourcing. The NCBI-ID is allready referenced, nothing is imported from UniProt. The Disclaimer tells us „The NCBI taxonomy database is not an authoritative source for nomenclature or classification - please consult the relevant scientific literature for the most reliable information.“ --Succu (talk) 11:40, 5 February 2016 (UTC)[reply]
- Here the Bot removed taxon name (P225). --Succu (talk) 12:10, 5 February 2016 (UTC) PS: Pseudomonas putida 10-23 (Q22661287) P225 is missing. --Succu (talk) 07:24, 6 February 2016 (UTC)[reply]
- I agree with Succu. Why go change species names, based on UniProt? Could do serious damage. And indeed that kind of sourcing is unwanted and adds nothing: database is slow enough as it is. - Brya (talk) 11:58, 5 February 2016 (UTC)[reply]
- This proposal does not seem to be mature. The Uniprot taxonomy database is a customized version of the NCBI taxonomy database, which itself is not reliable for taxonomy anyway. It is currently not clear if the bot owner knows enough about taxonomy and nomenclature to understand the issues associated with Wikidata taxon items. Also the proposed use of imported from Wikimedia project (P143) does not seem appropriate.
- Nevertheless my understanding is that many of this bot's contributions would be made in microbiology, and the issues would be a little different if its contributions were limited to this area. Otherwise I see no reasons to prevent the bot from adding “biosafety levels, oxygen requirements and other [such] features”.
- —Tinm (d) 18:29, 5 February 2016 (UTC)[reply]
- Yes the main basis of this bot will be within microbiology and I can restrict the bot to remain within prokaryotes. About the naming, what I am currently doing is to leave the name alone if it exists in UniProt taxonomy as either other name or scientific name. But I can leave the name as it is as I am mostly relying on the taxonomic identifier from the NCBI/UniProt. My main priority is to have the NCBI Taxonomy identifier correct / filled in so that I can include he phenotypic characteristics and also easily can verify wether an organism page has been created and if not create as such. I can also skip adding references if one is already available. --jjkoehorst (talk) 06:45, 6 February 2016 (UTC)[reply]
- Yes, this taxon name is pretty bad. And again, the fact that the rank is that of species does not need a reference (this is so by definition), and as there is a link to NCBI, the fact that the taxon name is accepted by NCBI does not need to be repeated in the form of a reference to taxon name. - Brya (talk) 07:44, 6 February 2016 (UTC) -also beyond understanding - Brya (talk) 07:51, 6 February 2016 (UTC) - And "instance of taxon" means that "taxon name" is present in the item. UniProt cannot know anything about that, so adding a reference to "instance of taxon" is pure misrepresentation. - Brya (talk) 07:58, 6 February 2016 (UTC)[reply]
- Sorry about those naming, ill restrain the bot then to only prokaryotes if you prefer and to only update missing naming and NCBI Taxonomy information. When that works out good i'll make some property requests for the phenotypic information as stated earlier, ok? --jjkoehorst (talk) 09:31, 6 February 2016 (UTC)[reply]
- If that means 1) only missing names of prokaryotes and 2) sourcing only for NCBI Taxonomy information, then yes, OK. - Brya (talk) 13:00, 6 February 2016 (UTC)[reply]
- Looks like the databases are out of sync. NCBI taxonomy ID (P685)=208964 gives Pseudomonas aeruginosa PAO1 (www.ncbi.nlm.nih.gov/taxonomy) and Pseudomonas aeruginosa (strain ATCC 15692 / PAO1 / 1C / PRS 101 / LMG 12228) (www.uniprot.org/taxonomy). This explains „adjustments“ like this one. --Succu (talk) 11:05, 6 February 2016 (UTC)[reply]
- Looks like UniProt provides five separate names, rolled up into one entry? - Brya (talk) 13:00, 6 February 2016 (UTC)[reply]
- There is a mapping between NCBI taxonomy ID (P685) and a so called „Official (scientific) name“ used by UniProt. So maybe we need a qualifier for P685 to indicate this name. --Succu (talk) 16:44, 6 February 2016 (UTC)[reply]
- Yes I had an email conversation with uniprot and this was a reply about that case: The idea is not to use a concise name. A same strain may be known by different names because it has been deposited in different organizations (institutions, private companies, etc) with different names. So we try to track these co-identical strain names used by the major concerned organizations for a specific strain. This name is stored as scientifcName and all the variances are stored among other names. --jjkoehorst (talk) 19:30, 6 February 2016 (UTC)[reply]
- So what's your conclusion? BTW: I stumbled over User:Phenobot/Discussion, which looks like an outline of the intended bot task, but not mentioned here. --Succu (talk) 20:28, 6 February 2016 (UTC)[reply]
- Well one way it makes sense to use a general nomenclature which encapsulates all possible extra namings but it is not the true scientific name. Maybe a taxon synonym name entry could be used which lists other names belonging to this organism.Yes the discussion page is to discuss the roadmap after the general taxon identification and naming is completed sorry that I did not mention it here but it was not completed yet to my opinion but feel free to comment on it if you like... --jjkoehorst (talk) 08:02, 7 February 2016 (UTC)[reply]
- Yes I had an email conversation with uniprot and this was a reply about that case: The idea is not to use a concise name. A same strain may be known by different names because it has been deposited in different organizations (institutions, private companies, etc) with different names. So we try to track these co-identical strain names used by the major concerned organizations for a specific strain. This name is stored as scientifcName and all the variances are stored among other names. --jjkoehorst (talk) 19:30, 6 February 2016 (UTC)[reply]
- Strictly speaking these are not scientific names at all. The ICNP does not cover names at a rank lower than subspecies. AFAIK there is no formal system for naming strains, so this may well happen on an ad hoc basis, or according to a local standard. In fact, it would help somewhat not to put these in "taxon name". - Brya (talk) 08:29, 7 February 2016 (UTC)[reply]
- Then I would suggest that the names currently in WD should correspond to the NCBI nomenclature or to any of the Uniprot (scientificnames/othernames) if this is not the case then it should be either the scientific name from the NCBI or from UniProt if there is no reference available. What do you think? And where would you place the other names? As a common name or something else? --jjkoehorst (talk) 08:40, 7 February 2016 (UTC)[reply]
- Strictly speaking these are not scientific names at all. The ICNP does not cover names at a rank lower than subspecies. AFAIK there is no formal system for naming strains, so this may well happen on an ad hoc basis, or according to a local standard. In fact, it would help somewhat not to put these in "taxon name". - Brya (talk) 08:29, 7 February 2016 (UTC)[reply]
- ? The names in NCBI/Uniprot are not scientific names (not regulated by a Code of nomenclature). The most obvious way to handle strains would be to have a property "strain name" (perhaps to be combined with "parent taxon", etc). - Brya (talk) 09:33, 7 February 2016 (UTC)[reply]
- My consideration are the same. --Succu (talk) 10:10, 7 February 2016 (UTC)[reply]
- I agree a strain property should then be created which specifies the name of a strain? However taxon name then becomes obsolete for strains at least if I am correct. The elements that are obligatory for strains are then parent taxon, taxon rank, NCBI Taxonomy ID, general labels and instance of. Anything that else that can be used with the current properties? --jjkoehorst (talk) 11:49, 7 February 2016 (UTC)[reply]
- Yes, this new property should be used instead of P225. This would reduce "Format" violations of P225 too. --Succu (talk) 12:54, 7 February 2016 (UTC)[reply]
- Sounds good, who is going to propose for a new property for taxon name and can this taxon name then also contain multiple values, such as synonyms of the strain name or should another property be made for that? --jjkoehorst (talk) 14:47, 7 February 2016 (UTC)[reply]
- I think we need a second property UniProt name to modell the relationship to the NCBI id. In case of strains we could use aliasses to add the name variants. You can propose them at Wikidata:Property proposal/Natural science. --Succu (talk) 18:49, 7 February 2016 (UTC)[reply]
- A property "UniProt" to link to the UniProt-entries may be handy. Not sure what else you mean, as UniProt-entries may concern regular taxa as well as strains and whatever else UniProt includes. - Brya (talk) 06:40, 8 February 2016 (UTC)[reply]
- I think we need a second property UniProt name to modell the relationship to the NCBI id. In case of strains we could use aliasses to add the name variants. You can propose them at Wikidata:Property proposal/Natural science. --Succu (talk) 18:49, 7 February 2016 (UTC)[reply]
- Sounds good, who is going to propose for a new property for taxon name and can this taxon name then also contain multiple values, such as synonyms of the strain name or should another property be made for that? --jjkoehorst (talk) 14:47, 7 February 2016 (UTC)[reply]
- Yes, this new property should be used instead of P225. This would reduce "Format" violations of P225 too. --Succu (talk) 12:54, 7 February 2016 (UTC)[reply]
- I am not much in favour of multiple names in one item, and including out-of-use names beside the current name seems like a recipe for disaster. But we really do need a separate property "taxon synonym (string)" beside the present "taxon synonym [item]". - Brya (talk) 15:53, 7 February 2016 (UTC)[reply]
- Yes we should request for a taxon synonym string variant. Then by default it would be the scientific name of the NCBI nomenclature if no better name is available? --jjkoehorst (talk) 19:50, 7 February 2016 (UTC)[reply]
- Synonyms are an area full of hidden dangers. What we may really need are:
- "taxon synonym, homotypic (item)"
- "taxon synonym, heterotypic (item)"
- "taxon synonym, homotypic (string)"
- "taxon synonym, heterotypic (string)"
- Especially heterotypic synonyms may vary strongly, depending on point of view (references!). Brya (talk) 06:40, 8 February 2016 (UTC)[reply]
- I looked into: Property:P1843 which is a common name for a given taxon. As basis we could use the NCBI nomenclature for strains (and/or others?). And over time add the homotypic/heterotypic naming. Shall I run a test with the restricted settings I have now? Only bacteria, no name updating if there is a name available and no reference adding if the value is already present? --jjkoehorst (talk) 08:01, 8 February 2016 (UTC)[reply]
- @Brya: Regarding how to handle synonyms, I have thought of a way of doing things that would solve a very big part of the issues we encounter with the current one. I'm going to make a post about that on the project talk page when I'll have a bit of time. It would imply significant changes but I really believe it would answer many issues efficiently. Anyway, I guess you will see when I put it up. —Tinm (d) 02:34, 9 February 2016 (UTC)[reply]
- I will be most interested to see what you come up with. - Brya (talk) 06:13, 9 February 2016 (UTC)[reply]
Greetings all. I am part of the GeneWiki team and I am adding genes and proteins for bacteria under our MicrobeBot (talk • contribs • logs) account. see: MicrobeBot Task Page For my project it is important that there remain distinct strain items with NCBI taxonomy identifiers so I can link genes and proteins to them via found in taxon (P703). Just a thought, but we could distill some of the views here in a mockup of a Wikidata strain item in this table below? Using Pseudomonas aeruginosa PAO1 (Q21065234) as an example. I added some of the basics that are there for strain items now. I personally think a new 'NCBI strain name' type of property would be a good thing to have as these strain names are directly linked to the NCBI Taxonomy ID. Putmantime (talk) 18:46, 9 February 2016 (UTC)[reply]
Property | Description | Datatype | Expected value
(if not listed, see property definition) |
---|---|---|---|
P225 | taxon name | String | Species name? From NCBI, UniProt? |
P??? | strain name | String | Strain name From NCBI, UniProt, etc... |
P171 | parent taxon | Item | Bacterial species item e.g. Pseudomonas aeruginosa (Q31856) |
P105 | taxon rank | Item | Strain e.g. strain (Q855769) |
P685 | NCBI Taxonomy ID | String | 208964 |
What we are talking about is this:
Property | Description | Datatype | Expected value
(if not listed, see property definition) |
---|---|---|---|
P??? | strain name | String | Strain name From NCBI, UniProt, etc... e.g. Pseudomonas aeruginosa PAO1 (Q21065234) |
P171 | parent taxon | Item | Bacterial species item e.g. Pseudomonas aeruginosa (Q31856) |
P105 | taxon rank | Item | Strain e.g. strain (Q855769) |
P685 | NCBI Taxonomy ID | String | 208964 |
P??? | UniProt ID | String | from UniProt, different from UniProt protein ID (P352) |
- Brya (talk) 04:42, 10 February 2016 (UTC)[reply]
- I agree. P225, P1420 and P1843 should not be taken form NCBI, UniProt? No items should be created on this basis. --Succu (talk) 06:51, 10 February 2016 (UTC) PS: I added UniProt protein ID (P352) and miss now something like UniProt name. --Succu (talk) 08:02, 10 February 2016 (UTC)[reply]
- Not sure what you mean by "UniProt name". Is this something like "Pseudomonas aeruginosa (strain ATCC 15692 / PAO1 / 1C / PRS 101 / LMG 12228)", which to me does not look like a name but five names, for what may be (deemed to be) one strain. - Brya (talk) 11:39, 10 February 2016 (UTC)[reply]
- Yes, the so called „Official (scientific) name“ used by UniProt mapped to NCBI taxonomy ID (P685). --Succu (talk) 12:01, 10 February 2016 (UTC)[reply]
- It is long list, and many names are regular scientific names. Could you point out a few examples? - Brya (talk) 12:07, 10 February 2016 (UTC)[reply]
- 634452 ← Acetobacter pasteurianus (strain NBRC 3283 / LMG 1513 / CCTM 1153)
- 4024 ← Acer saccharum
- 441768 ← Acholeplasma laidlawii (strain PG-8A)
- 237531 ← Actinomycete sp. (strain K97-0003)
- 928294 ← Human adenovirus C serotype 1 (strain Adenoid 71)
- 262698 ← Brucella abortus biovar 1 (strain 9-941)
- 48984 ← Pantoea agglomerans pv. gypsophilae
- 45222 ← Parana mammarenavirus (isolate Rat/Paraguay/12056/1965)
- --Succu (talk) 12:23, 10 February 2016 (UTC)[reply]
But not all these names are unique to UniProt. For example, Acer saccharum is a regular botanical name, and Pantoea agglomerans pv. gypsophilae appears to be in fairly widespead use, as is Brucella abortus biovar 1 (strain 9-941). - Brya (talk) 17:32, 10 February 2016 (UTC)[reply]
- My thought was that jjkoehorst want's to integrate these names somehow. If the speclist is important for the planned bots job I can provide some statistics. --Succu (talk) 18:36, 10 February 2016 (UTC)[reply]
- Eventually I would like to create a most comprehensible but still useful taxonomy resource where people can easily search for organisms and their phenotypic characteristics. Also that when a new strain is sequenced its information can easily be integrated into WD according to a defined data model. However for this a solid ground needs to be established first and that is what I was thinking of. In general the primary identifier is the NCBI Taxonomic number. Which can be completed with information from NCBI scientific names and UniProt scientific / other names. If for obvious reasons this would introduce too many errors or is not according to the idea of how we should define a strain than this is perfectly fine to me. What was driving me from the beginning is that I want to connect phenotypic information from multiple resources to taxonomic identifiers and corresponding genetic makeup. I of course can do this on my own machine on my own little project and this would work out fine but no one else could benefit from this and thats why I started working on the idea of this phenobot (hence the name...).. In the discussion of the bot as mentioned by Succu I am expanding this idea further with possible phenotypic characteristics that I can get my hands on and could theoretically be integrated into WD but I am still writing on this User:Phenobot/Discussion. --jjkoehorst (talk) 21:04, 10 February 2016 (UTC)[reply]
- As an example these are statements that would be interesting to add. Not all have properties and I am preparing for that.
Property | Description | Datatype | Expected value |
---|---|---|---|
P1604 | biosafety level | Item | Level 1 Q18396533
Level 2 Q18396535 Level 3 Q18396538 Level 4 ... see Q21079489 |
Property: P2043 | length / size | string | 902320 bp Q21481789 |
P??? | GC content | float | |
P??? | Gram staining | item | Gram positive Q857288
Gram negative Q632006 |
P??? | Pathogenic to | item | Human, Plant, Animal, etc... |
P??? | Motility | item | Chemotactic (Chemotaxis) Q658145
Motile Q3359 Nonmotile (not yet found) |
P??? | Environment | item or string | soil, seawater, marine sediment, forest soil, etc... |
P??? | Temperature range | item | Hyperthermophile Q1784119 |
Property: P2076 | Temperature (optimal temperature) | Q21079489 |
--jjkoehorst (talk) 09:11, 11 February 2016 (UTC)[reply]
- If all that is to be included in an item, it becomes understandable that Succu would like a UniProt name, and (presumably?) a separate item for each such UniProt entity. - Brya (talk) 17:26, 11 February 2016 (UTC)[reply]
- If I understand you correctly you mean to store the Biosafety/Gram/Temp/etc.. in a UniProt item? These are generic features from different sources (DSMZ/GOLD/etc) and are linked via the NCBI Taxonomy ID and in that case would not make sense to store these items under a uniprot name entry. --jjkoehorst (talk) 19:46, 11 February 2016 (UTC)[reply]
Back to the roots[edit]
Oppose: Back to the roots. „Code“ is protected. I see no reactions on error reports. The task is obscure. jjkoehorst, please rollback your bots contributions. --Succu (talk) 22:32, 11 February 2016 (UTC)[reply]
- Code is unlocked and all revisions are drawn back. Please lets continue on what kind of shape would be acceptable for phenotypic information --jjkoehorst (talk) 06:51, 18 February 2016 (UTC)[reply]
I think there is great value in elements of what are proposed and it would make the microbial data on wikidata a much richer resource. Meta data such as Biosafetly level, gram -/+ etc.. would be very useful, but getting Taxonomy identifiers and names from UniProt may not be the best source. I think it would benefit this proposal to have a clear picture of what the scope of the project would be, and a clear definition of each bot task. Putmantime (talk) 23:16, 11 February 2016 (UTC)[reply]
- Putmantime, mind to help? --Succu (talk) 23:21, 11 February 2016 (UTC)[reply]
- Succu Yes definitely...can we keep the discussion going on this proposal? I think it has merit, but needs to be clearer. The naming issue for subspecies items seems to have thrown a wrench in things. I think NCBI is a good authority for strain names personally, because the name was submitted by the researcher that submitted sequence data to NCBI, and that is when the NCBI Taxonomy ID was generated as well as genome IDS. Not a scientific name though or consistently formatted. I view it as an appropriate label, and maybe a new 'strain name' property, but see it shouldn't be a taxon name. Any synonyms could be aliases, IMHO Putmantime (talk) 23:34, 11 February 2016 (UTC)[reply]
- I am in the process of rolling back the changes made by the bot. I think the focus of the conservation has been shifted towards the naming issues which still exists and need to be discussed thoroughly. Currently existing names will not be modified by the bot and its main focus is on the metadata that is available at various resources through the NCBI taxonomic identifier which will not interfere with current information. I know that I initially started about the naming but the main focus is on the metadata. Hopefully we can keep the discussion going on the naming scheme and microbial metadata to come to a good agreement to improve the quality of information in Wikidata. --jjkoehorst (talk) 17:36, 12 February 2016 (UTC)[reply]
- Succu Yes definitely...can we keep the discussion going on this proposal? I think it has merit, but needs to be clearer. The naming issue for subspecies items seems to have thrown a wrench in things. I think NCBI is a good authority for strain names personally, because the name was submitted by the researcher that submitted sequence data to NCBI, and that is when the NCBI Taxonomy ID was generated as well as genome IDS. Not a scientific name though or consistently formatted. I view it as an appropriate label, and maybe a new 'strain name' property, but see it shouldn't be a taxon name. Any synonyms could be aliases, IMHO Putmantime (talk) 23:34, 11 February 2016 (UTC)[reply]
- In the NCBI Taxonomy strains have no rank. We should find a consens that stating taxon rank (P105)=strain (Q855769) is OK. Otherwise we can use instance of (P31)=strain (Q855769) with taxon rank (P105)=novalue. --Succu (talk) 18:51, 12 February 2016 (UTC) E.g. Shigella flexneri 2a str. 301 (Q21102941), Putmantime. --Succu (talk) 22:13, 12 February 2016 (UTC)[reply]
- There are similar cases elsewhere: "virus" as a subspecific entity is not regulated by a Code of nomenclature. This goes also for "forma specialis", "pathovar", etc. We should have a structure for this. - Brya (talk) 06:17, 13 February 2016 (UTC)[reply]
- Yes we should. If I remember right f.sp. is used by IF and MycoBank as a rank. Strongly related to this bots task is the question of Candidatus (Q857968). --Succu (talk) 19:18, 13 February 2016 (UTC)[reply]
- Yes, forma specialis is used by IF and MycoBank as a rank, but that does not make it a rank. And, yes, "Candidatus" is a similar problem case. - Brya (talk) 09:55, 14 February 2016 (UTC)[reply]
- @Jjkoehorst: Do you still plan on creating this bot? If not, I can close this request. — PinkAmpers&(Je vous invite à me parler) 23:46, 4 March 2018 (UTC)[reply]
We can close this request. --Jjkoehorst (talk) 17:58, 13 June 2018 (UTC)[reply]