Wikidata:Requests for permissions/Bot/Orcbot
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
Approved--Ymblanter (talk) 20:11, 4 February 2022 (UTC)[reply]
Orcbot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: EvaSeidlmayer (talk • contribs • logs)
Task/s: The bot makes use of author-publication-matches from ORCID database to match existing publication-items and author-items in Wikidata.
Code: https://github.com/EvaSeidlmayer/orcid-for-wikidata
Function details: The bot aims for matching authors-items and publication-items on the ground of ORCID database. Only already in Wikidata existing authors-items and already in Wikidata existing publications-items are matched.
ORCID contains in 2019 eleven archive files. For the first archive file we had been able to detected:
- 457K Wikidata publication-items (3.8M publications in total)
- 425K publication-items do not have any author-item registered
- 32K publications are identified in Wikidata with registered authors
- of those 32K publication-items:
- 3.7K author-items listed in Wikidata are correct allocated to their publication-items (11.7%)
- 4.2K author-items listed in Wikidata are not yet allocated to publication-items (24.6%)
- The other authors are not registered to Wikidata yet.
These are the numbers only for the *first* of *eleven* ORCID-files. Would be cool to introduce the matching of authors to publications on ORCID basis.
- @EvaSeidlmayer: Thanks for working on this. One thing I don't see in your github README or your statement here is how you plan to match up the authors with the existing author name string (P2093) entries for these articles - or is the plan just to add the author (P50) entries with no qualifiers and not removing the existing name strings? Matching name strings is quite tricky, especially as given names are often abbreviated, some parts of names may be left out, joined together in different ways, etc. Not to mention name changes... And there can be two authors on the same paper with the same surname, or partially matching surnames. These issues have tripped up a number of automated approaches here in the past. There are also issues with duplicate or otherwise erroneous ORCID records, which have also tripped things up - for example there have been some major imports of this sort of author data from Europe PMC which, when there are duplicate ORCID's, lists both, resulting in an offset for all the author numbers (series ordinal (P1545) qualifiers) after that point. Anyway, this is definitely useful, but can be harder than it seems. ArthurPSmith (talk) 17:50, 30 July 2020 (UTC)[reply]
- @EvaSeidlmayer, ArthurPSmith:
Support. It is accurate that the bot task is harder than described. However, it is important to begin even if the bot is not complete. Effectively, she can further develop later the bot's source code. Just a brief note. I advise Ms. Eva to create a user page for the bot on Meta. --Csisc (talk) 18:25, 30 July 2020 (UTC)[reply]
- @Csisc: I created a user page on Meta. However, now I dont see how I can create/retrieve a API-token for the bot. Is there a documentation?
- Please also make some test edits.--Ymblanter (talk) 19:15, 12 August 2020 (UTC)[reply]
- @Ymblanter: I tried to make some test edits in the test Wikidata instance taking into account the different properties numbers as well. However, I was told I do not have the bot right for pushing to the test instance. Where can I get the bot right for the test instance? --Eva (talk) 16:03, 26 August 2020 (CET)
- Sorry, I am not sure I understand the question. Can you make about 50 test edits here? You do not need the bot flag for the test edits.--Ymblanter (talk) 18:58, 26 August 2020 (UTC)[reply]
- @Ymblanter: Hm. Strange. I checked again, the instance I refer to is the test instance: "wb config instance / https://test.wikidata.org/w/api.php" But when I push only *one* file (such as "wb create-entity Q123.json") I get: "{ assertbotfailed: assertbotfailed: You do not have the "bot" right, so the action could not be completed...." What did I do wrong? --Eva (talk) 09:44, 27 August 2020 (CET)
- Unfortunately, I do not know. You may want to ask at a better watched place such as the Project Chat--Ymblanter (talk) 18:27, 27 August 2020 (UTC)[reply]
- I managed to do some test edits with Orcbot in the test instance. In order to connect them with Orcbot subsequently by adding author statements to article items P242, I created some scientific article items and authors item manually.
- authors:
- Josepha Barrio Q212734
- Shuai Chen Q212749
- Raphael de A da Silva Q212755
- articles:
- Prevalence of Functional Gastrointestinal Disorders in Children and Adolescents in the Mediterranean Region of Europe. Q212738
- Dietary Saccharomyces cerevisiae Cell Wall Extract Supplementation Alleviates Oxidative Stress and Modulates Serum Amino Acids Profiles in Weaned Piglets Q212750
- Amino-acid transporters in T-cell activation and differentiation. Q212751
- Dietary L-glutamine supplementation modulates microbial community and activates innate immunity in the mouse intestine. Q212752
- Insight in bipolar disorder: a comparison between mania, depression and euthymia using the Insight Scale for Affective Disorders. Q212753
- Changes in absolute theta power in bipolar patients during a saccadic attention task. Q212754
- authors:
The article now have an author statement what was missing before.
The template for the connection looks like this: {"id": "Q212754", "claims": {"P242": {"value": "Q212755", "qualifier": [{"P80807": "('Rafael', 'de Assis da Silva')"}]}}}
@Csisc, Ymblanter: What is the next step to establish the Orcbot? --Eva (talk) 14:04, 1. September 2020 (CET)
- Could you please do a few edits here (they may be the same as on test wikidata if appropriate).--Ymblanter (talk) 20:06, 1 September 2020 (UTC)[reply]
- @EvaSeidlmayer: Can you write down the message in red issued by the compiler. --Csisc (talk) 09:49, 2 September 2020 (UTC)[reply]
- @Csisc: Not sure if this is the message expected, but this is what I get when I try to log in after I reset the credentials to: "invalid json response body at http://www.wikidata.org/w/api.php?action=login&format=json reason: Unexpected token < in JSON at position 0" This is the red part. However, first I am asked to "use a BotPassword instead of giving this tool your main password". --Eva (talk) 14:02, 2. September 2020 (CET)
- @EvaSeidlmayer: Try to use requests.post instead. See https://www.wikidata.org/w/api.php?action=help&modules=login for login documentation. --Csisc (talk) 14:17, 3 September 2020 (UTC)[reply]
- Hey @Csisc:, when I'm logged in as EvaSeidlmayer@Orcbot using abc1def2ghi3jkl4mno5pqr6stuv7wxyz as password I receive this message: "permissiondenied: You do not have the permissions needed to carry out this action." I use Wikidata-CLI for the interaction. --Eva (talk) 22:43, 4. September 2020 (CE
- @EvaSeidlmayer: Try to use Orcbot as a username (just the bot username). You can also change to Wikidata Integrator (https://pypi.org/project/wikidataintegrator/). --Csisc (talk) 11:35, 8 September 2020 (UTC)[reply]
- It worked after I updated the bot password including "edit existing pages". :) Afterwards, I was able to do the test edits:
The authors are now registered (P50) to their publications:
Q48080592 Changes in absolute... → Q47701823 Raphael de A da Silva Q40249319 Insight in bipolar... → Q47701823 Raphael de A da Silva Q43415493 The complete picture of changing pediatric inflammatory... → Q85231573 Josefa Barrio Q37721105 Dietary Saccharomyces cerevisiae... → Q61824599 Shuai Chen Q41082700 Amino-acid transporters.. → Q61824599 Shuai Chen Q51428341 Dietary L-glutamine supplementation.. → Q61824599 Shuai Chen
@Csisc:, sorry it took so much time! --Eva (talk) 09:27, 9. September 2020 (CET)
- @EvaSeidlmayer: This is an honour for me. --Csisc (talk) 15:00, 9 September 2020 (UTC)[reply]
What is the next step to get this approved? NMaia (talk) 13:28, 24 November 2020 (UTC)[reply]
- I still do not see test edits--Ymblanter (talk) 19:56, 25 November 2020 (UTC)[reply]
- @EvaSeidlmayer: Did you make the test edits by running the bot script with your account, e.g. this edit to add an author? I notice that you didn't add object named as (P1932) or series ordinal (P1545) qualifiers and the author name string (P2093) claim for the same author was not removed. Will Orcbot make these edits when importing data?
- Presuming Orcbot is going to add stated as qualifiers, will the name formatting be consistent with an item's existing author and author name string statements? Since the large imports of scholarly article (Q13442814) bibliographic data were, to the best of my knowledge, primarily from PubMed and CrossRef, there is a risk that using a different source (i.e. ORCID) could result in inconsistent data, such as a combination of initialised and full given names. It won't be an issue when adding authors to new publication items created by Orcbot. But it might be preferable to handle existing items differently and copy data from the existing author name string to a new author claim. Simon Cobb (User:Sic19 ; talk page) 01:08, 8 January 2021 (UTC)[reply]
- Hey @Sic19:, thank you for thinking along! Regarding the problem of potential different "name formatting" from PubMed, CrossRef and ORCID, the OrcBot requests all labels and aliases for an author QID (which is supposed to be registered as author (P50) to an article (reference ORCID public data file)). OrcBot uses the following command for doing this:
wb d author_QID | jq -r '.labels,(.aliases|.[])|.[].value' | sort | uniq
Then, OrcBot compares all of these spellings with the names stated in author name string (P2093). By this means, OrcBot makes sure that the series ordinal from author name string (P2093) can be transferred correctly to author (P50). Does this solve your objection? Did I understand you correctly? Eva (User:EvaSeidlmayer ; talk page) 19:07, 14 January 2021 (UTC)[reply]
Hey, sorry for late response. Yes, OrcBot runs as EvaSeidlmayer. I can change this if this is necessary. When User:Rdmpage pointed out the lack of series ordinal (P1545) and author name string (P2093). I stopped OrcBot (in November 2020). I am currently on the improvement of OrcBot which involves the transfer of information (series ordinal) from author name string (P2093) to author (P50). Afterwards, the author name string (P2093) statement will be deletetd as some tools cannot deal with both statements (author (P50), author name string (P2093)) at the same time. Eva (User:EvaSeidlmayer ; talk page) 18:44, 14 January 2021 (UTC)[reply]
Dear all @Csisc:, @Rdmpage:, @Sic19:, @JakobVoss: it took some time (due to some job issues) to rework the code of Orcbot regarding the remarks of User:Rdmpage. The new version takes care to transfer series ordinal (P1545) from (P2093) to author (P50). It will also remove author name string (P2093) statement afterwards and add a statement about the source of information stated in stated in stated in (P248) as ORCID Public Data File 2021 (Q110411020) (currently). In the last weeks I prepared the data from ORCID public data file 2021 for ingest. I also tried to change the user indicating Orcbot-activities from EvaSeidlmayer@Orcbot to Orcbot@Orcbot. However I am not sure if it worked out since I am not able to use Orcbot anymore. In late 2020 Orcbot was able to edit thousandth of entries. Now I get the error message: ```assertbotfailed: assertbotfailed: You do not have the "bot" right, so the action could not be completed.``` my config is: {
"instance": "https://www.wikidata.org", "credentials": { "https://www.wikidata.org": { "username": "Orcbot@Orcbot", "password": "..." } }, "bot": true
} I thought I have the bot rights, but right now I am not sure about the status of Orbot in general. I would still like to complete the task of connecting journal articles to author (P50) as I am convinced this will help to improve the data quality of Wikidata. Thank you for any light shedding a little bit of this.
- @Ymblanter: Thoughts on this proposal? Thanks. Mike Peel (talk) 21:34, 18 January 2022 (UTC)[reply]
- It is too technical for me to have an opinion, I rely on others' opinions.--Ymblanter (talk) 21:35, 18 January 2022 (UTC)[reply]
- @Mike Peel: When I try to do test edits with the improved version of Orcbot to the test instance of Wikidata I also get an error: https://www.wikidata.org/wiki/Wikidata:Project_chat#permission_to_wikidata_test_instance? Eva (talk) 10:16, 19 January 2022 (UTC)[reply]
- @EvaSeidlmayer: Orcbot doen't have 'bot' status on this wiki at the moment. I'm not sure how it works on the test instance - I can't see the chat you link to? But probably worth filing something on phabricator, or maybe asking at Wikidata:Report a technical problem. Thanks. Mike Peel (talk) 10:21, 19 January 2022 (UTC)[reply]
Orcbot made real progress. I managed to changed the user to Orcbot so all edits to Wikidata test instance are finally marked as done by Orcbot.
These are some examples of adding authors to articles that were not previously included in the article entry (no author, no author name string) but have an entry on Wikidata. Orcbot adds author Q-ID as author. It checks labels and aliases of the author. And Orcbot registeres ORCID public data file 2021 (Q223807) as reference for information. Here in ORCID the person himself:herself had identified the article as his:her own (with DOI, Pubmed-ID or similar):
- Q223858: Effects of Early Placement of Transjugular Portosystemic Shunts in Patients With High-Risk Acute Variceal Bleeding: a Meta-analysis of Individual Patient Data
author Horia Stefanescu (Q223860) stated as Horia Ștefănescu stated in ORCID Public Data File 2021 (Q223807)
- Q223906: Role of the Carotid Body in an Ovine Model of Renovascular Hypertension
author Fiona D. McBryde (Q223907) stated as Fiona McBryde stated in ORCID Public Data File 2021 (Q223807)
- Q223908 Severe Acute Respiratory Syndrome Coronavirus 2, COVID-19, and the Renin-Angiotensin System: Pressing Needs and Best Research Practices
author Matthew A. Sparks (Q223909) stated as Matthew Sparks stated in ORCID Public Data File 2021 (Q223807)
These are some examples of adding authors to articles that were previously only included with an author name string (and have an entry on Wikidata). Orcbot checks labels and aliases of author. It checks if author is already listed as author name string. It checks if there is series ordinal given in author name string statement. Orcbot reads series ordinal. It adds a author statement including series ordinal and information on the reference. finally Orcbot deletes the author name string.:
- Q223836 "The U.S. Geological Survey’s Rapid Seismic Array Deployment for the 2019 Ridgecrest Earthquake Sequence"
author Emily Wolin (Q223849) stated as Emily Wolin series ordinal 2 stated in ORCID Public Data File 2021 (Q223807)
- Q223835 "Pressure coring a Gulf of Mexico deep-water turbidite gas hydrate reservoir: Initial results from The University of Texas–Gulf of Mexico 2-1 (UT-GOM2-1) Hydrate Pressure Coring Expedition"
author Stephen C. Phillips (Q223848) stated as Stephen C. Phillips series ordinal 2 stated in ORCID Public Data File 2021 (Q223807)
- Q223829 "Influence of dilatancy on the frictional constitutive behavior of a saturated fault zone under a variety of drainage conditions"
author Derek Elsworth (Q223846) stated as Derek Elsworth series ordinal 2 stated in ORCID Public Data File 2021 (Q223807)
- Q223864 “Quality of cardiopulmonary resuscitation: Degree of agreement between instructor and a feedback device during a simulation exercise”
author José Ríos-Díaz (Q223870) stated as José Ríos-Díaz series ordinal 6 stated in ORCID Public Data File 2021 (Q223807)
- Q223862 Nup84 persists within the nuclear envelope of the rice blast fungus, Magnaporthe oryzae, during mitosis
author Chang Hyun Khang (Q223869) stated as Chang Hyun Khang series ordinal 2 stated in ORCID Public Data File 2021 (Q223807)
- Q223855 MR Imaging of Osteoid Osteoma: Pearls and Pitfalls
author Monica Epelman (Q223866) stated as Monica Epelman series ordinal 2 stated in ORCID Public Data File 2021 (Q223807)
- Q223853 Between the Big Trees: A Project-based Approach to Investigating Shape and Spatial Thinking in a Kindergarten Program
author Jane Page (Q223865) stated as Jane Page series ordinal 3 stated in ORCID Public Data File 2021 (Q223807)
- Q223879 Advanced Liver Fibrosis Predicts Unfavorable Long-Term Prognosis in First-Ever Ischemic Stroke or Transient Ischemic Attack
author Ji Hoe Heo (Q223880) stated as Ji Hoe Heo series ordinal 3 stated in ORCID Public Data File 2021 (Q223807)
- Q223881 A dose response relationship between accelerometer assessed daily steps and depressive symptoms in older adults: a two-year cohort study (Q223881)
author Brendon Stubbs (Q223882) stated as Brendon Stubbs series ordinal 2 stated in ORCID Public Data File 2021 (Q223807)
- @Ymblanter:, :@Mike Peel: how many test edits do you want me to perform? --Eva (talk) 13:26, 26 January 2022 (UTC)[reply]
- We usually ask for about 50, but a lower number of fully representative tests would also be fine.--Ymblanter (talk) 13:43, 26 January 2022 (UTC)[reply]
- Dear @Ymblanter: please check Orcbot multiple test edits:
- https://test.wikidata.org/wiki/Q223914 (no author name string yet)
- https://test.wikidata.org/wiki/Q224017 (no author name string yet, multiple other author string names, one other author)
- https://test.wikidata.org/wiki/Q223928 (given author name string, given series ordinal)
- https://test.wikidata.org/wiki/Q223935 (given author name string, given series ordinal)
- https://test.wikidata.org/wiki/Q224002(given author na@me string, given series ordinal)
- https://test.wikidata.org/wiki/Q223998 (given author name string, given series ordinal; other author had been stated)
- https://test.wikidata.org/wiki/Q223976 (given author name string with different alias, given series ordinal)
- https://test.wikidata.org/wiki/Q223987 (given author name string with different alias, given series ordinal)
- https://test.wikidata.org/wiki/Q223992 (given author name string with different alias, given series ordinal)
- https://test.wikidata.org/wiki/Q224072 (added multiple authors, some not stated before, one stated as author-name string with alias, one as author-name-string without series ordinal, one as author name string including series ordinal, ignoring author which is already stated as author, ignoring author which had no Q-Id yet.)
- https://test.wikidata.org/wiki/Q224070 (added multiple authors, some not stated before, one stated as author-name string with alias, one as author-name-string without series ordinal, one as author name string including series ordinal, ignoring author which is already stated as author, ignoring author which had no Q-Id yet.)
- https://test.wikidata.org/wiki/Q224085 (added multiple authors, some not stated before, one stated as author-name string with alias, one as author-name-string without series ordinal, one as author name string including series ordinal, ignoring author which is already stated as author, ignoring author which had no Q-Id yet.)
- https://test.wikidata.org/wiki/Q224087 (added multiple authors, some not stated before, one stated as author-name string with alias, one as author-name-string without series ordinal, one as author name string including series ordinal, ignoring author which is already stated as author, ignoring author which had no Q-Id yet.)
- https://test.wikidata.org/wiki/Q224094 (added multiple authors, some not stated before, one stated as author-name string with alias, one as author-name-string without series ordinal, one as author name string including series ordinal, ignoring author which is already stated as author, ignoring author which had no Q-Id yet.) Eva (talk) 17:00, 31 January 2022 (UTC)[reply]
- Looks good to me, I can approve the bot in a few days provided no objections have been raised.--Ymblanter (talk) 17:32, 31 January 2022 (UTC)[reply]
- Hi. Sorry, I think for articles with PubMed ID (P698) like https://test.wikidata.org/wiki/Q223914 , The task Wikidata:Requests for permissions/Bot/Cewbot 4 will do more complete (with series ordinal (P1545))... Please see the test edits like Genes and SNPs Involved with Scrotal and Umbilical Hernia in Pigs (Q110650530). Maybe you can edit these 10 articles shows in the Wikidata:Requests for permissions/Bot/Cewbot 4, to let us know the interactions between these two bots? And will you create new researcher items with ORCID? Kanashimi (talk) 21:34, 2 February 2022 (UTC)[reply]
- Dear @Kanashimi:, I guess this is a misunderstanding: Orcbot will *not create any new item*, neither new article items nor author items. Orcbot establishes semantic connections (via author (P50)) between article items and author items only if both already exist in Wikidata. The aim is *no quantitative expansion* of Wikidata but a *qualitative enhancement* as there are a lot of items with few semantic connection. Eva (talk) 12:52, 3 February 2022 (UTC)[reply]
- So in case of https://test.wikidata.org/wiki/Q223914 the workflow was like this: A person "Ichiro Ikuta" had claimed authorship for an article in ORCID public data file. We harvested those information an prepared it. Orcbot then checks if the author and the publication is registered to Wikidata. If this is the case: Orcbot checks if the author is already stated in the article item. If the author is stated as author (P50) nothing happens. If the author is not stated at all, Orcbot will register the author as author (P50). If the author is already listed as author name string (P2093) (which is quite often the case) she will get a proper author (P50) claim instead. author name string (P2093) will be deleted after other information (series ordinal (P1545)) is transferred to the author (P50) statement. To be honest https://test.wikidata.org/wiki/Q223914 is a fake article, as I was tiered at some point to reproduce the articles for editing reasons in the Wikidata test instance. Eva (talk) 17:14, 3 February 2022 (UTC)[reply]
- Thanks for your explanation. I understand now. My bot is doing the same thing using different source. Kanashimi (talk) 21:47, 3 February 2022 (UTC)[reply]
- We usually ask for about 50, but a lower number of fully representative tests would also be fine.--Ymblanter (talk) 13:43, 26 January 2022 (UTC)[reply]