Wikidata:Requests for permissions/Bot/StreetmathematicianBot 2
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Not done, no progress since January. Feel free to re-open this if work resumes. Thanks. Mike Peel (talk) 18:42, 24 September 2022 (UTC)[reply]
StreetmathematicianBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Streetmathematician (talk • contribs • logs)
Task/s: Use the Crossref API to turn author name string (P2093) statements into disambiguated author (P50) statements based on ORCID iDs.
Code:
Function details: This task does not create new items. It adds statements linking existing author items to existing article items.
Bot operation: Where
- there is an item with a best DOI
- there is another item with a best ORCID iD
- Crossref API data associates the DOI with the ORCID iD
- the existing items are not linked
- all names match*: Crossref's first+family name, the author item's en: label, and the author name string of the article item
- the positions of the author in the lists of authors also match
the bot will:
- add the author (P50) statement with the author item as object; qualify it with object named as (P1932) and the existing author name string.
- copy references, and the position, from the author name string (P2093) statement to the author (P50) statement
- add an extra reference to the Crossref API, since that is the reference for the actual identity of the author
- remove* the author name string (P2093) statement
The Crossref API and the Crossref public data file are free to use for any purpose and its results are not covered by copyright, with the exception of abstracts (abstracts are not used for this task).
The data quality of the data provided by the Crossref API seems good enough to me to allow adding the new statements at normal rank.
I will perform test edits and link to them if there are no objections.
As the code currently stands, the bot will visit articles with several authors that can be disambiguated several times. This is suboptimal for articles with many authors, since it will result in lengthy edit histories.
Footnotes:
- Precise name matching will be used
for the first runs. If there are no problems, less-precise matching (missing periods after initials; accents and diacritics missing in one version of the name but not the other; names that differ only in hyphenation of their components) may be useful, but I'm happy to restrict this request to the precise matches and create another request for less-precise matching once I can give full details on how it will operate. - Whether author name string (P2093) statements should really be removed is currently being discussed on WD:PC. However, it should be pointed out that even if they are removed, the intention is that no information is lost in the process, and it will thus be possible to restore the author name string (P2093) statements at a later point automatically.
--Streetmathematician (talk) 19:45, 20 November 2021 (UTC)[reply]
- sounds good to me BrokenSegue (talk) 02:02, 21 November 2021 (UTC)[reply]
- 5 test edits. Streetmathematician (talk) 08:13, 21 November 2021 (UTC)[reply]
- @ArthurPSmith: you commented at WD:PC, but I hope it's okay to respond here:
- There are many articles with several matching author name strings. Some of those are mistakes, others appear legit. I would suggest to skip such articles for now, since I believe they need human attention.
- My plan is to start with exact name matches. After that:
- punctuation, capitalization, diacritics, and whitespace changes are worth handling automatically, I believe, but care must be taken to preserve or add all variants in use
- I'm not sure about name order and expanded names vs initials vs omitted names, which may cause a small number of false positives
- stemming, Damerau-Levenshtein neighbors (typos, minor spelling differences) and further transformations: as suggestions for semi-manual edits only
- Just to clarify, I'm not using data from ORCID, just ORCID iDs provided by Crossref and ORCID iDs already in Wikidata. Streetmathematician (talk) 07:37, 23 November 2021 (UTC)[reply]
- @Streetmathematician: There are many subtleties in name-matching, and some previous bots doing this sort of thing have done it poorly. Examples of issues beyond punctuation/hyphenation/capitalization are: (1) Handling of suffixes: Jr, III, etc., (2) The source for many author name strings (pubmed I think) often reverses the name so last name is first, initials strung together after, eg. "Smith AP". (3) Spanish and other names with multiple "family name" components, with one source having only one of the family names as "last name" - "Jose Garcia Hernandez" may need to match a last name of "Garcia" for example. (4) Chinese and other names where the family name is often first, but in western scientific publications often reversed as an author name, and the given name is often two syllables separated by a hyphen, or sometimes joined together depending on the source: "Wang Wei-Min" might be also "Wei-Min Wang", "Wei Min Wang", "Weimin Wang", 'Wang WM', 'W.-M. Wang', 'W. Wang', 'W-M. Wang', etc. etc. Anyway, if this request is sticking to just exact matching of the name string and avoiding cases where there are multiple matches then that should be fine for now; it will certainly be a big help to start with and we can revisit the other issues later. ArthurPSmith (talk) 15:58, 23 November 2021 (UTC)[reply]
- Thank you. I agree that matching names is very difficult. I'm proposing to do it only as a safety check to catch the (regrettably common) case in which our sources are inconsistent (so it's "is it implausible those two names refer to the same person" not "here are two lists of a million names, find out who's who"). Nevertheless, I would like to amend my original proposal to be restricted to exact matches only. Also, I've come across a few articles with very many authors, and I'd also like to leave those for later. Streetmathematician (talk) 16:53, 23 November 2021 (UTC)[reply]
- @Streetmathematician: There are many subtleties in name-matching, and some previous bots doing this sort of thing have done it poorly. Examples of issues beyond punctuation/hyphenation/capitalization are: (1) Handling of suffixes: Jr, III, etc., (2) The source for many author name strings (pubmed I think) often reverses the name so last name is first, initials strung together after, eg. "Smith AP". (3) Spanish and other names with multiple "family name" components, with one source having only one of the family names as "last name" - "Jose Garcia Hernandez" may need to match a last name of "Garcia" for example. (4) Chinese and other names where the family name is often first, but in western scientific publications often reversed as an author name, and the given name is often two syllables separated by a hyphen, or sometimes joined together depending on the source: "Wang Wei-Min" might be also "Wei-Min Wang", "Wei Min Wang", "Weimin Wang", 'Wang WM', 'W.-M. Wang', 'W. Wang', 'W-M. Wang', etc. etc. Anyway, if this request is sticking to just exact matching of the name string and avoiding cases where there are multiple matches then that should be fine for now; it will certainly be a big help to start with and we can revisit the other issues later. ArthurPSmith (talk) 15:58, 23 November 2021 (UTC)[reply]
@Streetmathematician: This seems to be stale, is this still active? Perhaps @Ymblanter, Lymantria: could comment? Thanks. Mike Peel (talk) 22:18, 18 January 2022 (UTC)[reply]
- The plan seems good to me, I would like to see some more test edits, say 100. Lymantria (talk) 06:17, 19 January 2022 (UTC)[reply]