User talk:LargeDatasetBot/archive

From Wikidata
Jump to navigation Jump to search

Bad DOI's[edit]

This bot has been entering thousands of incorrect DOI values - presumably doing some sort of HTML escaping which leads to incorrect values. They should be stored in Wikidata without any such HTML escaping - in particular '<' and '>' have been translated to '<' and '>' - or even worse double-translated to '&LT;' etc. These format errors do not appear to be in the original source - at least not when I look for example at Q70852128 - the Europe PMC DOI I see following the link appears to be correct. An example with double-escaping is Q70829386. ArthurPSmith (talk) 16:18, 7 November 2019 (UTC)[reply]

Please notice again if you see any issue for entities with PMID>9000000.--GZWDer (talk) 13:55, 8 November 2019 (UTC)[reply]

Double authors in scientific articles ?[edit]

There seems to be a problem with double author entries please see: https://www.wikidata.org/w/index.php?title=Q53935729&oldid=1043090344 Kpjas (talk) 08:12, 19 November 2019 (UTC)[reply]

Another one: Gemma Nedjati-Gilani (Q87800322) and Gemma Nedjati-Gilani (Q38848664)Finn Årup Nielsen (fnielsen) (talk) 09:57, 17 March 2020 (UTC)[reply]

Block[edit]

Hey, I have blocked your bot because it's going too fast causing issuing on the infra, please reduce the speed. Amir (talk) 14:50, 3 December 2019 (UTC)[reply]

removing authors? Please stop![edit]

What was this edit about? Many authors had been identified with items, and the bot reverted most of them?! ArthurPSmith (talk) 18:52, 23 February 2020 (UTC)[reply]

@ArthurPSmith: The order of author in PubMed and CrossRef are not always the same (in the example above, A. Piucci is the 500th author in Crossref and 501st in PubMed). To prevent issues the bot removed all authors before adding new one. The bot will not edit at all if there're fewer resolved authors than what existed in the items (i.e. the bot will only increase the number of resolved authors). In the future some of the other authors may be resolved via other sources (e.g. ORCID API). The current script will be largely EOL once all PubMed articles are imported.--GZWDer (talk) 04:28, 26 February 2020 (UTC)[reply]
@GZWDer: Your latest edit there just removed 539 identified authors on that item and replaced them with 140, that is NOT an improvement. Your edit also had the following issues: (1) None of the author statements has the "stated as" qualifier, which is important in identifying and correcting problems like when one author on the paper is matched to more than one name; (2) the last author in your version (#728) is wrong - C Prouve is #517 (and still listed), (3) You have TWO "Andrea Mauri"'s linked to the paper, only one of which is correct, there's only ONE "A Mauri" in the original; (4) Thanks to the two Mauri's, it appears that ever author after that (#412 and on) has an author number one higher than it should have. These problems are not simple to correct. Your code should never assert itself over previously human-curated entries like this. Adding references is fine; adding errors and wiping out previous work is not ok. ArthurPSmith (talk) 14:42, 26 February 2020 (UTC)[reply]
Before the bot's first edit the item have 120 authors, so the bot does increate the number of resolved authors. I have reverted my edit and the bot will not edit the item in its current state. Though it's clearly an error, European PubMed Central (the source of the bot data) does list two "A Mauri"s and also " LHCb Collaboration" as an author with ORCID same as C Prouve's.--GZWDer (talk) 00:56, 27 February 2020 (UTC)[reply]
Thanks for undoing that one. On Mauri - the second Europe PMC link goes to this ORCID page which says it's locked - looks like somebody was claiming papers that were not really theirs. I suspect Europe PMC is doing some heuristic here rather than going to the original source article, so a lot of these large-collaboration papers may have issues like this. In particular you might want to somehow flag any paper where you find that particular ORCID id as looks like it's likely to (A) be wrong and (B) have had the effect of bumping up all subsequent author numbers... ArthurPSmith (talk) 13:21, 27 February 2020 (UTC)[reply]

Bad data in Orcid[edit]

Do you have a way of blacklisting external ids that have bad data? Rakesh Srivastava (Q56839662) has apparently hoovered up articles attributed to "R. Srivastava", regardless of identify. E.g., I changed Synthesis of 2,3,6-trideoxy sugar triazole hybrids as potential new broad spectrum antimicrobial agents. (Q38978416) to Ranjana Srivastava (Q89232283), and the other 70 or so articles are also questionable. Ghouston (talk) 03:38, 2 April 2020 (UTC)[reply]

@Ghouston: If you notify ORCID about the problem they usually lock the id so it won't be visible or used for these updates. ArthurPSmith (talk) 18:06, 2 April 2020 (UTC)[reply]
Thanks, it has been locked, although the bad data remains in Wikidata. Ghouston (talk) 13:54, 14 April 2020 (UTC)[reply]

This analytics link show a subset of one hour of bot edits from 1 March:

https://tools.wmflabs.org/wikidata-todo/user_edits.php?sparql=SELECT+%3Fq%0D%0AWHERE+%7B%3Fq+wdt%3AP5008+wd%3AQ55439927%7D&pattern=%2F*+wbeditentity-update%3A0%7C+*%2F&user=LargeDatasetBot&start=2020030101&end=2020030102&doit=Do+it&format=html

It clearly show ten diffs in which statements instance of (P31) review article (Q7318358) are simply removed. This is not acceptable, and it seems that large-scale removal of statements has been happening.

In other words, the bot has been removing key data about medical literature.

Simply judging by this subset of edits to the focus list of the ScienceSource (Q55439927) project, the code used for editing items about scientific articles is flawed. Most of the review article (Q7318358) statements have been added by the NCBI2wikidata bot, over about nine months in 2019.

Please review your code, and please do not run any more such batches. We should discuss reverting what you have been doing. The edit summaries do not carry useful information about what you are doing, so I have to ask you to stop these edits. Charles Matthews (talk) 11:38, 6 April 2020 (UTC)[reply]

The code will mostly be EOL (it will only be used to create items for newly-added PubMed article and existing items will no longer be touched). A new system will be developed for ORCID API.--GZWDer (talk) 19:13, 10 April 2020 (UTC)[reply]
Soon I am going to readd those instance of (P31) statements using this result.--GZWDer (talk) 19:17, 26 April 2020 (UTC)[reply]

In relation to your recent edit summary: no, I will revert all such edits where your code has removed instance of (P31) statements added by my project. I estimate there are some thousands of such edits. Your comment above is unhelpful. It is not a real response to the points I made. Charles Matthews (talk) 19:16, 26 April 2020 (UTC)[reply]

@Charles Matthews:.--GZWDer (talk) 19:18, 26 April 2020 (UTC)[reply]

I am investigating more closely what has been happening, and posting to Wikidata:ScienceSource project/Removals of statements by LargeDatasetBot. Obviously I can do more to refine the analytics. It is open to you to offer your own figures. But it can be shown that something like 17K items have lost statements, and that the tool query I'm using only looks at a subset of those items.

My estimate above did not exaggerate. Charles Matthews (talk) 15:53, 27 April 2020 (UTC)[reply]

Hi, GZWDer, It seems that your results are not in the same order as the 17K removals Charles Matthews is mentioning. Could you please elaborate more? Lymantria (talk) 11:28, 29 April 2020 (UTC)[reply]
Current code does remove the instance of (P31)=review article (Q7318358) if the bot edit the item. This will be added back in the future. A list of review articles will be generated and instance of (P31) filled to Wikidata.--GZWDer (talk) 11:33, 29 April 2020 (UTC)[reply]
Please note that my concern is not solely with review article (Q7318358). Additions may be also instance of (P31) systematic review (Q1504425), or meta-analysis (Q815382), case report (Q2782326), editorial (Q871232). There is information about all these publication types in PubMed, and the MeSH descriptor ID (P486) system of identifiers has something like 100 types that may occur there. It is really not going to be possible, easily, to give an exhaustive list of the types of statements with instance of (P31) that may have been removed, without analysing all the edits. I can only speak about those that I have added, with the bot, or by hand (which I have done frequently).
Such information is exactly what is needed to understand the value of papers for medicine. Wikidata:ScienceSource project/MEDRS report is all about the use of Wikidata to implement the guideline w:WP:MEDRS that is fundamental to writing medical articles on enWP. Statements instance of (P31) systematic review (Q1504425) from PubMed are the gold standard for medical referencing. Charles Matthews (talk) 11:52, 1 May 2020 (UTC)[reply]
This diff shows LargeDatasetBot removing a "review article" statement a few minutes after your "The code will mostly be EOL" reply on 10 April. This shows that you didn't take the problem seriously, some days after I left a message here.
What is said on Wikidata:Bots is this: "In the case of any damage caused by a bot, the bot operator is asked to stop the bot." You did not stop the bot: you gave an unhelpful answer. "The bot operator is responsible for cleaning up any damage caused by the bot." Your comments make it harder to see you understand that. "When working in namespaces that allow for customized edit summaries, bots should always use descriptive edit summaries that indicate what task is being performed and indicate that the action is being performed by a bot." It seems you changed the edit summary, but did not fix the code. "Bots must stay within reasonable bounds of their approved tasks." It seems to me that your work on changing URLs did not stay within the scope of your approval.
At Wikidata:Requests for permissions/Bot/LargeDatasetBot there are various tasks mentioned, but it becomes vague. As Jura1 says on that page, "I'd rather not have someone run a bot who "hopes" that known bugs get fixed later by someone else.". Did you understand that reasonable comment?
This is not about the imports from PubMed. It is about what you are doing now with the bot, and the approach you have taken recently. Charles Matthews (talk) 14:40, 4 May 2020 (UTC)[reply]
Well, it is now one month since I first posted here, and I don't see any definite action to address the problem. This is a serious situation.
What else does the code remove? Other may be adding instance of (P31) statements to items in this area. We need to establish what LargeDatasetBot has done in terms of writing over the work of other editors. You simply cannot discuss a large problem like this one, without giving it proper attention. You need to reply to the points. Charles Matthews (talk) 14:08, 6 May 2020 (UTC)[reply]
For now, this bot will not touch instance of (P31) and publication date (P577) of existing items.--GZWDer (talk) 09:45, 7 May 2020 (UTC)[reply]

Thank you for that answer. This is now a dispute, in which I am asking you to comply fully with Wikidata:Bots. I will also draw attention that there seems to be no adequate process to deal with the problems you are causing, for other editors of Wikidata. You should reply promptly and helpfully, and the evidence is that you are reluctant to do that. Charles Matthews (talk) 14:27, 16 May 2020 (UTC)[reply]

@Charles Matthews, GZWDer: Please see the discussion at Towards more consistent P31 usage across the WikiCite corpus. --Daniel Mietchen (talk) 01:59, 20 May 2020 (UTC)[reply]

@GZWDer: LargeDataset removed a statement from Mast cells in meningiomas and brain inflammation (Q21146660) on 24 February that was instance of (P31) systematic review (Q1504425), at the same time as removing a review article statement. I would be grateful for your explanation of the feature of the bot software that made this happen. It is clearly difficult to define the damage that has been done precisely, and replacing the review article statements will not be enough.

The investigation I have begun at Wikidata:ScienceSource project/Removals of statements by LargeDatasetBot#Removal from the focus list shows damage being done over the period 21 December to 4 May. On the figure I have given above, it suggests that the daily removals of statement averaged 100 over that time. It also indicates that you ran faulty bot software to 4 May, for several weeks after I raised the issue with you. I'd be grateful for an explanation of that, too. Charles Matthews (talk) 15:01, 31 May 2020 (UTC)[reply]

@Charles Matthews: Before 7 May, once the bot edit an item, it will remove all existing statements with same property that the bot wants to add. This is how WikidataIntegrator designed. If your recent run will not add back the P31 I am planning to import them from another source.--GZWDer (talk) 15:14, 31 May 2020 (UTC)[reply]

@GZWDer: Thanks for the explanation. It seems that WikidataIntegrator is completely the wrong software to be using for the task. But I don't actually understand the explanation you have given, since sometimes statements are removed and sometimes not.

In any case, I go back to what I said originally. The whole run should be reverted. Why not? Your design decision disregarded the work of other people working here.

The fact that you continued to run the bot, without consultation, and without a definite plan, seems to me very serious.

Please let me know when you started using the bot with this program. Charles Matthews (talk) 16:08, 31 May 2020 (UTC)[reply]

@Charles Matthews: Why is it not enough just to readd such statements? I have collected a list of two million review articles and it will be imported in the next 2-4 weeks.--GZWDer (talk) 16:11, 31 May 2020 (UTC)[reply]

@GZWDer: Please note that you are responsible for fixing all the damage you have caused. All of it. Every removed statement, of every kind. I mentioned other kinds of statements on 1 May.

Please answer the question about the date of the beginning of the bot run. Charles Matthews (talk) 16:16, 31 May 2020 (UTC)[reply]

@Charles Matthews: it is also able to get a list of editorial (574817 entries), meta-analysis (114484 entries) or case-report (199043 entries).--GZWDer (talk) 16:27, 31 May 2020 (UTC)[reply]
Note the issue may exist in the first edit of the bot (Aug 22 2019), though the bot will only edit an existing item if one of the following meets: 1. it lacks all of publication date (P577), published in (P1433) and title (P1476), 2. it lacks an ID (one of DOI, PubMed, PMC) that will be added or 3. more resolved authors are found than the item currently have.--GZWDer (talk) 16:32, 31 May 2020 (UTC)[reply]

Thank you for all this information. The overall situation with the issues reported on this page, and the Wikidata:Requests for permissions/Bot/LargeDatasetBot, seems completely unsatisfactory. User:Lymantria knows my views on this particular issue; and other steps I have taken.

It is not acceptable to create hundreds of hours of work for others. The bots policy is not to become the subject of a negotiation. I know there are others who have posted to this page who think the same way.

While a powerful bot to create new items is a good idea, there need to be stronger safeguards for edits on existing items.

You need to respond constructively in the "Repairing" section below, and to accept that you now need to give it the highest priority. Charles Matthews (talk) 16:55, 31 May 2020 (UTC)[reply]

After detailed investigation, no other types of removal of "review article" statements appear. There are removals by LargeDatasetBot over the period 21 December 2019 to 4 May, spread evenly.

These statements contribute to the metrics for the project grant under which I was working in 2018-9. It is unacceptable that around 25% of those additions, over around nine months in 2019, should simply be removed by another bot. They are for a curated list, developed by the project, and relating to the reliability of chosen, Creative-Commons licensed sources.

If there are funding or project reasons for which you felt entitled to continue to run LargeDatasetBot after you knew the damage was occurring, please give them.

@GZWDer:, I have a serious request for you now. Please co-operate in getting the Wikidata:History Query Service tool working. There are technical issues (see User talk:Tpt) but I think as a developer you may be able to help. This tool, if available, would make it much easier to define and fix the kind of problems discussed in this thread, and others. See [1] for details.

It is not enjoyable to have disputes here that cannot be solved, and last several months. It would be a great benefit to the community to have the appropriate technology for dispute resolution.

Charles Matthews (talk) 13:39, 21 June 2020 (UTC)[reply]

@Charles Matthews: I have generated a list of 30016 pages that may be affected (the numbers are curid). The next step I will use a bot to check each of them.--GZWDer (talk) 14:31, 21 June 2020 (UTC)[reply]
Also: please let me know if there're users other than you who added instance of (P31)=review article (Q7318358).--GZWDer (talk) 14:33, 21 June 2020 (UTC)[reply]
@Charles Matthews: This is a list of P31 removed, for items edited by you:

--GZWDer (talk) 14:22, 22 June 2020 (UTC)[reply]

@GZWDer: Thank you for this work addressing the issue raised here. Charles Matthews (talk) 14:40, 22 June 2020 (UTC)[reply]

LargeDatasetBot still removing linked authors[edit]

@GZWDer: I've noticed several edits in which LargeDatasetBot removed author (P50) statements and replaced them with author name string (P2093) statements, which undoes work linking author items. For example, see this edit before and after. LargeDatasetBot should not be removing author (P50) statements under most circumstances. John P. Sadowski (NIOSH) (talk) 22:13, 13 April 2020 (UTC)[reply]

The workflow of WikidataIntegrator is create a "diff" on existing item. For publication, this includes a list of known (with ORCID) authors and a list of unknown authors. Because the order of author in PubMed and other sources are not always the same all existing authors will be removed. As the diff contains a new PubMed ID the bot thinks it is an improvement and save it. In the future, more authors may be resolved using other sources.--GZWDer (talk) 22:28, 13 April 2020 (UTC)[reply]
Again, this workflow will change significantly in the future.--GZWDer (talk) 22:32, 13 April 2020 (UTC)[reply]
@GZWDer: In this case the number and order of authors before and after was the same. You should change the bot so that it does not remove any author (P50) statements. Removing these is always undesirable behavior, unless there is a verified mismatch. John P. Sadowski (NIOSH) (talk) 01:49, 14 April 2020 (UTC)[reply]
  • I agree, it's stupid to convert P50s into P2093s. I reverted the change at [2]. I don't see much value in adding PubMed references either, when the original data didn't come from PubMed, but from the original DOI or article iself. E.g., [3] says the publication date is 16 March 2020, PubMed says April 2020 because it's published in the "April issue", I suppose. Ghouston (talk) 10:43, 25 April 2020 (UTC)[reply]

Continuing concerns[edit]

Hi GZWDer,

I see this edit from a minute ago. It appears that issues mentioned before have not been solved. The instance of (P31) review article (Q7318358) still get removed. The publication date (P577) is still replaced by a less precise one. I think those issues should be taken care of before proceeding. I will therefore block the bot. If the bot is repaired, let me know and do a test run. Lymantria (talk) 17:19, 4 May 2020 (UTC) P.S. Block duration is a week, but I expect you understand that I will lift the block when you are ready to test the repaired bot. Lymantria (talk) 17:22, 4 May 2020 (UTC)[reply]

@Lymantria: please unblock the bot for testing.--GZWDer (talk) 23:52, 4 May 2020 (UTC)[reply]

[Not Available][edit]

For Q93996405 these items, I'm not sure how much value there is if the titel of the paper does not exist. If that would happen on just one item, then it was all in the game, but it seems to be the <general label> for many items recently created. Is this really the title of the article in your source, or is this a sign of an error in the process? Edoderoo (talk) 20:00, 11 May 2020 (UTC)[reply]

Not Available[edit]

Hi! In Q93995246, Q93995244, and Q93995242 you asserted that these articles are titled "[Not Available]", which I believe to be a false claim. I invite you to clean up this and any other cases. Cheers, Bovlb (talk) 20:03, 11 May 2020 (UTC)[reply]

When I chased the PubMed links, I easily found titles, in English and German. [See more on this below.]
I think the bot should never add a title of "[Not available]". (If that requires a special case, so be it.)
It also looks like there are titles available in PubMed which the bot isn't finding? —Scs (talk) 11:53, 12 May 2020 (UTC)[reply]
  • For whatever reason, there was a very high percentage of these on May 11 around 19:00; see contributions.
They all have non-English titles, I wonder if that has to do with the problem?
@GZWDer: Do you have plans to have the bot re-import these titles from PubMed correctly, or should we find another way? —Scs (talk) 12:41, 12 May 2020 (UTC)[reply]
  • I just noticed -- some of you probably knew this already -- that what we have here is two very different ways of fetching data from PubMed.
Take for example Q93991171. If you view our entity (as imported by LargeDatasetBot), it's got a PubMed ID (P698) of 11633140, which is linked to https://pubmed.ncbi.nlm.nih.gov/11633140/, which shows a German title of "Zwei Medizinische Texte Aus Assur". (We don't even have to guess the language; the page explicitly says "Article in German".)
So why did LargeDatasetBot give it an English title of "[Not Available]"? Well, that's easy to see, since LargeDatasetBot has properly referenced the claim: It's fetched from a webservice at www.ebi.ac.uk, specifically here, where the title is indeed listed as "[Not Available]." I don't know anything about www.ebi.ac.uk, or how it might be persuaded to return meaningful, non-English titles. —Scs (talk) 13:00, 12 May 2020 (UTC)[reply]
The English title may be simply removed, but I don't know whether it is better to replace with foreign one (i.e. add title in German as English label) once available.--GZWDer (talk) 14:16, 12 May 2020 (UTC)[reply]
I don't think we should add title in German as English label. I think we should add title in German as German label. —Scs (talk) 14:33, 12 May 2020 (UTC)[reply]
If you know the title, but not its language, you could add it with title (P1476) and language code "und". Also, the title can be an alias. --- Jura 15:19, 12 May 2020 (UTC)[reply]
@GZWDer: Why is this still happening five days later? Mahir256 (talk) 09:05, 17 May 2020 (UTC)[reply]
@Mahir256: Current only "[Not Available]." and "Not Available." is catched. This is going to be fixed. Please ping me if this occurs on an item with Pubmed ID higher than 32000000.--GZWDer (talk) 09:58, 17 May 2020 (UTC)[reply]

Hi! I'm seeing 10,474 items with the title "[Not Available]". [4] Do you have a plan for tidying this up? Cheers, Bovlb (talk) 14:33, 21 May 2020 (UTC)[reply]

@GZWDer: You don't seem to have had time to respond to my question, but I see you have been making good progress on fixing up the titles and my query immediately above now returns no results. Unfortunately, the same problematic string has crept into many labels, as shown in the 10,211 results for https://w.wiki/Ru7 Cheers, Bovlb (talk) 19:23, 26 May 2020 (UTC)[reply]

ORCID ID attached to wrong author name string[edit]

I noticed that this version of James T Yurkovich (Q87993450) had a label "Bernhard O Palsson" and an ORCID of 0000-0002-9403-509X, which comes with a name string of "James Yurkovich" and links to a Loop profile for "James T Yurkovich". That latter author also seems to be the correct choice for the currently 11 papers linked to the item. I have thus changed the label of the item from "Bernhard O Palsson" to "James T Yurkovich" but am puzzled as to how this particular kind of mismatch might have come about. --Daniel Mietchen (talk) 01:29, 20 May 2020 (UTC)[reply]

@Daniel Mietchen: I have noticed similar issues, though not quite so bad, which seem to stem from the source (Euro-PubMed?) having linked the ORCID to the wrong author in the author list. Their matching heuristics are quite imperfect - often they seem to look only at the last name, and may match the same ORCID to several authors at once. ArthurPSmith (talk) 03:24, 21 May 2020 (UTC)[reply]

Repairing[edit]

According to the comments above there is damage done to existing items. I see you are repairing the bot in order to stop de bot causing more trouble, but I miss plans or actions to actually repair earlier damage. According to Bot policy you should ("The bot operator is responsible for cleaning up any damage caused by the bot"). Please enfold your plans. Lymantria (talk) 15:38, 27 May 2020 (UTC)[reply]

I will start second round of recheck of PubMed several hours later (affecting ~40000 items), once this is done I will fix existing ones.--GZWDer (talk) 20:15, 27 May 2020 (UTC)[reply]
Note some newly created item may have the same DOI as existing ones, but with different titles. This may due to 1. error in EuropePMC or 2. DOI assignments point to a page that has multiple articles. If community wants to, the duplicated DOIs may be marked as deprecated.--GZWDer (talk) 06:07, 30 May 2020 (UTC)[reply]

LargeDatasetBot still removing linked authors, again[edit]

LargeDatasetBot is still removing author (P50) statements in at least some cases. See [5] and [6] for just two recent examples I found. As discussed multiple times before (#removing authors? Please stop! and #LargeDatasetBot still removing linked authors), this is undesirable behavior, as existing P50 statements should never be removed unless there is a verified error. The bot also seems to be removing object named as (P1932) qualifiers, which are useful to keep for verification. Could you please immediately stop the bot from removing any author (P50) statements, as it's time-intensive for me to track down and restore them? John P. Sadowski (NIOSH) (talk) 01:03, 30 June 2020 (UTC)[reply]

More junk[edit]

Why is this bot adding invalid data (examples: [7], [8], [9], [10]), when validation checks can easily be done? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:24, 21 July 2020 (UTC)[reply]