Wikidata talk:WikiProject Source MetaData/Archive 3

From Wikidata
Jump to navigation Jump to search

Keywords

Hi! How should I present the keywords in the case of scientific papers (most of the papers contain 5-7 keywords, best describe the topic of the article)? Should I use the main subject (P921) or we have an other property for that? Samat (talk) 23:06, 20 December 2017 (UTC)

P921 is fine. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 11:51, 21 December 2017 (UTC)

The problem is, that the keywords in scientific articles are very specific expressions (they are not so general, like material science or medical science), while main subject (P921) requires existing Wikidata items. Most of the keywords (or even the a more general topic, subject) don't exist in Wikidata and I am not sure if I should create them... I tried to find the most related, existing items, but 1) many times not even this solution helps and 2) this means, I don't use the original keywords but only similar, related ones. Samat (talk) 22:03, 7 January 2018 (UTC)

@Samat: do you have examples, please? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 14:23, 11 January 2018 (UTC)

New property proposal

The Source MetaData WikiProject does not exist. Please correct the name.

Since the tasks to improve referencing in Wikidata scored low on the Community Wishlist Survey, I'm thinking of taking a baby step.

I'd like to propose a new pair of properties <reference ISBN-13> and <reference ISBN-10>. These would link to an external target just as ISBN-10 (P957) and ISBN-13 (P212) do, and could be qualified with page numbers, author name strings, etc. as the editor chooses.

By themselves, these properties would make sourcing statements to books easier for editors - especially new editors. Over time, this data could be used by bots to create new work and edition items, and utlimately replace the <reference ISBN-XX> statement with <stated in> but that step is not essential to make these references useful for verifying statements.

What do project members think? - PKM (talk) 19:49, 30 December 2017 (UTC)

@PKM:, Wikidata won't solve the reference problem until we have an easy way of adding references. After going through the long process of creating a book reference I wrote a sandbox proposal for a reference string to hold something like "Jane Lancaster (2004), Making Time: Lillian Moller Gilbreth, a Life Beyond "Cheaper by the Dozen", Northeastern University Press". I have not proposed it since there is such strong insistence here that sources must be Wikidata items themselves (and authors also items and publishers also items and references used to support author and publisher items also items and references to support those items also items...).
I think your suggestion has a lot of merit but ISBN numbers are not unique. It costs money to register one so small publishers reuse them (defeating the purpose of course). My late husband and I had a bookstore that sold on Amazon, and we had several serious problems with this.
What about suggesting to editors that they create an abbreviated item entry for book sources, just the title and ISBN? This is much easier than creating a complete source item. The project importing tens of thousands of scientific articles has set a standard of using the full article title as the label as well as the title. This would work for labels for books, including the subtitle with the title. The resulting item would have the ISBN-10 or -13 link with enough information that the non-uniqueness wouldn't be a problem. We could add that to Help:Sources. StarryGrandma (talk) 19:59, 8 January 2018 (UTC)

Importing all articles from the EFSA journal

The EFSA journal is an open-access publication by the European Food Safety Agency. It has periodic evaluations of food additives, and it would thus be very useful to link those articles to the food additives that exist in Wikidata, paving the way for sourced recommandations in outside projects like Open Food Facts.

http://onlinelibrary.wiley.com/doi/10.2903/j.efsa.2017.4787/full

I have no idea how the import of periodicals is done in Wikidata. Is that a request for titles ? Does it require tedious work ?

Teolemon (talk) 19:28, 7 November 2017 (UTC)

How many articles do you expect to import? In any case please start with an item about the journal (either create or improve) before creating article items -- JakobVoss (talk) 21:15, 9 November 2017 (UTC)
@JakobVoss: Possibly all articles related to food additives evaluations (and re-evaluations), so quite a lot possibly. I've just created the item. Q45098548 Teolemon (talk) 13:04, 9 December 2017 (UTC)
@Teolemon: Can you do an example for one article ? Snipre (talk) 13:17, 21 December 2017 (UTC)
I just created Q46394883, much more structured data at https://api.crossref.org/v1/works/http://dx.doi.org/10.2903/j.efsa.2017.4787 from http://onlinelibrary.wiley.com/doi/10.2903/j.efsa.2017.4787/full, but I can't believe a tool doesn't exist yet to do it automatically. Teolemon (talk) 21:42, 21 December 2017 (UTC)
Crossref's metadata are often incomplete or partly wrong. In your case the API returns slightly more than 5000 articles. I dont think all of them are really useful as a source here, but if there is an agreement, my bot can create them. BTW: your example Re-evaluation of sodium nitrate (E 251) and potassium nitrate (E 252) as food additives (Q46394883) is missing a lot of basic properties. :( --Succu (talk) 22:21, 21 December 2017 (UTC)
Duly noted for Crossref. Hopefully the more basic data (authors…) is right. Those articles have a lot of valuable data that won't probably be structured (whether the additive is authorized, dangerous, if so in which doses…), in addition to the more classic data. Also, having items for the authors is specially important, since any conflict of interest with the food lobbies should be documented (and they shouldn't have any, ideally). I'll be adding a lot of data on this item to turn it into a showcase. Teolemon (talk) 08:20, 22 December 2017 (UTC)
I expanded your example Re-evaluation of sodium nitrate (E 251) and potassium nitrate (E 252) as food additives (Q46394883) with information taken from Crossref. If nobody objects I can create the items. --Succu (talk) 16:11, 22 December 2017 (UTC)
Thanks. Added the actual food additives the article is talking about, the XLS sources mentionned by the article and a couple of other tiny things. acceptable daily intake (P2542) will also be interesting to summarize the findings.Teolemon (talk) 11:25, 23 December 2017 (UTC)
@Succu: Nothing to add(itive) it seems :-P I guess you can proceed Teolemon (talk) 21:20, 30 December 2017 (UTC)
Maybe next year? ;) --Succu (talk) 21:28, 30 December 2017 (UTC)
😇 Teolemon (talk) 15:31, 8 January 2018 (UTC)
Planned for this week. --Succu (talk) 15:34, 8 January 2018 (UTC)
Done, Teolemon. --Succu (talk) 16:02, 10 January 2018 (UTC)
<3 <3 <3 Teolemon (talk) 10:15, 11 January 2018 (UTC)
@Succu: Food Additives & Contaminants: Part A Chemistry, Analysis, Control, Exposure & Risk Assessment would be a nice addition Food Additives and Contaminants Part A (Q3076429) et Food Additives And Contaminants Part B: Surveillance Communications (Q15724517), Food Additives and Contaminants (Q15760547) Teolemon (talk) 21:23, 7 February 2018 (UTC)

Importing a defined batch of articles - best approach?

I have a batch of articles I'd like to import records for (they're the various papers in Biographical Memoirs of Fellows of the Royal Society (Q4914871), which I'm planning to crosslink to their subjects - we have about 300 but there's another 1600 to go). I have a list of valid DOIs but doing them one at a time through sourceMD looks impractical. Is there a tool for this somewhere, or is it best for me to get the metadata off crossref and produce entries from that? I'm willing to do that, but obviously would prefer to use a tool if one is available :-). Andrew Gray (talk) 16:57, 20 January 2018 (UTC)

  • Okay, I ended up writing a script for this :-). Takes a pile of JSON files from CrossRef, turns them into a QuickStatements upload. It's a bit disjointed and has some odd things hardcoded at the moment, but I'll try and get it polished up in the next few days.
Doing this import threw up some interesting issues, which I'll note here in case they're of use to anyone.
  • We seem to tend to call everything a scholarly article (Q13442814) even when it's not strictly speaking describable as scientific and a more generic academic journal article (Q18918145) would be appropriate. Then it turns out that academic journal article (Q18918145) is a subclass of scientific publication (Q591041) - there's probably some taxonomy cleanup needed here. Likewise I suspect a lot of non-scientific titles are labelled as scientific journal (Q5633421).
  • It's interesting how much changes in terms of what gets exported by CrossRef over a couple of years - I asked my import to match and update older papers where necessary and very often the exact details had changed - author name strings had gone from initials to full names (or sometimes vice versa), pagination had shifted slightly, etc. This probably reflects once-off data cleanup by the source and is unlikely to happen several times over for a given paper, but still surprised me. This also means doing a quick cleanup afterwards looking for (eg) two authors tagged as position 1, or two values for pagination, is a good idea.
  • Some older imported articles on Wikidata have come in from PubMed and have a PMID but no DOI (probably the DOI did not exist when they were first imported to PubMed and they've never been reconciled). These require manual deduplication or else some matching on title/author - but they also tended to be the ones with weaker title/author metadata, such as only having the first part of the title or slightly garbled author initials.
  • CrossRef can return some valid information for a deleted DOI, so if you're identifying all papers in a journal by querying a predictable DOI range, be careful to check for deleted values. Otherwise this happens...
  • It's not clear how we should model "in press" papers - available online but where we know that the metadata (date, pagination, etc) will change in the near future (usually a couple of months, but the extreme case I've seen is >3 years) eg/ Vladimir Igorevich Arnold. 12 June 1937—3 June 2010 (Q47485609). Is there a way to code "publication status"? Andrew Gray (talk) 17:45, 24 January 2018 (UTC)

Citation templates

I miss citation templates and modules such as Template:Cite Q (Q22321052) and Module:Cite Q (Q33429959) in Wikidata. There is Module:Cite but what is the preferred way to use Wikidata items to create citations? -- JakobVoss (talk) 16:12, 3 February 2018 (UTC)

This grant proposal has another week to run. There has built up plenty of comment so far, and some longer threads on the Talk page.

One of the stated goals of the project is to add metadata to items about scientific articles. Also of some interest, I think, for this WikiProject is that such metadata would be put to use, for checking medical referencing. Charles Matthews (talk) 12:46, 13 February 2018 (UTC)

« bots set standards for best practices » ??

Quote from the current version of the page :

I’m translating this and I don’t really see how the global failure of the RfC implied that « bots set standards for best practices ». Could anyone explain or I’ll rewrite the sentence. author  TomT0m / talk page 14:46, 2 March 2018 (UTC)

Wikimania submissions

As per Wikidata:Wikimania_2018#WikiCite, we now have a draft doc to coordinate WikiCite-related submissions. About 48h left until the deadline. --Daniel Mietchen (talk) 00:21, 17 March 2018 (UTC)

Satellite WikiCite track at the Wikimedia Hackathon 2018 in Barcelona

The Source MetaData WikiProject does not exist. Please correct the name. WikiProject Books has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. If you're attending the Wikimedia Hackathon 2018 in Barcelona in May, we'd love to have you at our WikiCite satellite track. Please add your name and any relevant project you'd like to hack on. And ICYMI: we also submitted a workshop proposal at Wikimania 2018 in Cape Town. See you there? --Dario (WMF) (talk) 22:33, 23 March 2018 (UTC)

copyright holder property?

For some publications, copyright is transferred to the publisher; for others, it remains with the authors. For instance, PLOS copyrights remain with the authors (example), and they can relicense their paper again if they like. On the other hand, Elsevier requires the authors to sign a w:Copyright transfer agreement. This means Elsevier can later relicense it. Although retroactively making the CC license more restrictive doesn't work. Could we add a parameter for the copyright holder, in cases where it may not be the author? HLHJ (talk) 02:06, 27 March 2018 (UTC)

It exists, P3931, so I've added it to "Wikidata properties related to bibliographic metadata". HLHJ (talk) 23:43, 6 April 2018 (UTC)

Conflict-of-interest metadata

Some sources have serious conflicts of interest, which are not immediately obvious. For example:

Rippe, J. M; Angelopoulos, T. J (2016). "Sugars, obesity, and cardiovascular disease: Results from recent randomized control trials". European Journal of Nutrition. 55 (Suppl 2): 45–53. doi:10.1007/s00394-016-1257-2. PMC 5174142. Freely accessible. PMID 27418186.

This looks like it might be a solid medical review, and a good medical source. However, there is some information missing from this citation.

This isn't any old article in the w:European Journal of Nutrition. It says "Suppl"; it's actually from a "supplement sponsored by Rippe Health" (list of accessible COIs in European Journal of Nutrition supplements). Rippe Health is in turn sponsored by producers of sugary foods, among others, like the w:Corn Refiners Association (sic). The editor of the supplement is w:James M. Rippe, the founder and director of Rippe Health. Apparently the editor and the lead author are the same person.

James Rippe's COIs as an author are declared in the paper; there is no declaration of his COIs as an editor or as the director of the supplement funder that I can find. I can't find information about the funding sources or COIs of the European Journal of Nutrition, or its editorial staff (although the latter might be available via academic homepages, articles published by them, etc.).

To take another example, the American Journal of Clinical Nutrition (and the the Journal of Nutrition) are run by the W:American Society for Nutrition. The ASN has received some criticism for industry funding, see W:American Society for Nutrition#Corporate relationship concerns and list at w:Talk:American Society for Nutrition#Funding. The COIs of the editorial board of AJCN are declared here and summarized here.

These examples are expanded from w:Wikipedia talk:WikiProject Medicine#Sponsored supplement?.

The European Food Safety Authority runs a journal; see this project's influence on the use of this journal at Wikidata talk:WikiProject Source MetaData/Archive 3#Importing all articles from the EFSA journal. See the W:European Food Safety Authority#Criticism for third-party statements about the agency's COIs. I looked at the journal's website for COI information and found this page, which appears to state that the database is empty and the EFSA will tell you the COIs of its editors on request.

If most of the more standard COIs could be tracked automatically, and missing information flagged, it would make scholarly communications much more transparent. Wikimedia is a high-value target for shilling and misinformation, and finding truly independent sources can be difficult and time-consuming for editors. I think a pop-up COI details flag on references, for instance, would be great.

We have a start with Crossref funder ID. Does anyone have suggestions for other properties or approaches that would be useful? HLHJ (talk) 18:40, 26 March 2018 (UTC)

It seems as if supplement (Q2915731)  View with Reasonator View with SQID with the properties sponsor (P859) View with SQID and editor (P98) View with SQID would be best. I'm not sure how one would indicate the relationship of a supplement to the journal that it is a supplement of, though. HLHJ (talk) 01:38, 7 April 2018 (UTC)
Tried putting it in Template:Bibliographic properties. Please let me know if I've messed it up. HLHJ (talk) 01:49, 7 April 2018 (UTC)
Example made at Sugars, obesity, and cardiovascular disease: results from recent randomized control trials (Q56479527)  View with Reasonator View with SQID. Note that the editor, the supplement publication sponsor, and the lead author are the same person. Note the sponsorship, too. This paper is not an independent source, but it is currently still cited as a MEDRS in Wikipedia. (some text copied from self on WP:MED talk) HLHJ (talk) 18:45, 5 September 2018 (UTC)
It should have been at the pre-existing Sugars, obesity, and cardiovascular disease: results from recent randomized control trials (Q37521442), as Daniel Mietchen pointed out. I've merged them and resolved all the duplicate data except the supplement (which actually has a name, Supplement on Sugar Consumption Controversy (Q56479539)  View with Reasonator View with SQID). The supplement has both separate funding and a separate editor from the European Journal of Nutrition, and according to DGG would have been mailed to individual subscribers, but not to libraries (he categorizes it as a non-peer-reviewed ad). If the supplement does not have its own item, I'm not sure how to tag an individual article with an editor, especially as it would presumably clash with the editor of the EJN. Tagging an individual article with a sponsor makes sense, and I've done a related example at Dietary Fats, Carbohydrates and Atherosclerotic Vascular Disease (Q40050232)  View with Reasonator View with SQID, but here the sponsor sponsored the article, not the publication, which was apparently in ignorance of the sponsorship.
All the other papers from this supplement are already in Wikidata, too, although WhatamIdoing has recently removed them from en:Sugar:
What new properties do you think might be needed? I wrote a summary of some of the issues we might want to document at en:Conflicts of interest in academic publishing. Tags for journals' pledges to follow widely-recognised codes of conduct might be useful; the most recent version of the most common of those is Good publication practice in physiology 2017: Current Revisions of the Recommendations for the Conduct, Reporting, Editing and Publication of Scholarly Work in Medical Journals. (Q50061640)  View with Reasonator View with SQID. It seems to me that we need a way to tag papers with honorary author (Q42889533) and ghost author (Q43155099)  View with Reasonator View with SQID (when documented), and a way of listing the declared or reliably reported institutional COIs of the journal.
I've added the consulting listed by the original paper's lead author in the COI declaration to his record as "employers". The same could be done for journal staff, and peer reviewers in the case of open peer review. I've probably got some of this wrong, please let me know what. HLHJ (talk) 19:35, 9 September 2018 (UTC)

Books, editions, volumes, and exemplars

Wikidata:WikiProject Books allows different items to be created for a book (Q571) (ie the underlying work); a version, edition or translation (Q3331189); a volume (Q1238720); or a individual copy of a book (Q53731850).

Presumably, the citation template should allow any of these to be cited, with the user free to specify a particular edition or volume or copy either by choosing a particular Q-number (which may or may not exist), or by specifying that parameter explicitly.

However, items for the more specific levels will not necessarily re-specify all the bibliographic information, if it is the same as that for the parent level -- eg an item for a copy would not usually repeat author/publisher/publication-date information specified for an edition.

Are (or will) the citation templates be able to supply the missing fields by tracing back up the hierarchy ? Jheald (talk) 22:43, 18 May 2018 (UTC)

@Jheald: The problem is to define in which level some data have to be store. Currently the data model is not providing a clear overview of the different levels and of the related properties. If this classification is done, then it will be easy to develop a program to retrieve information from the correct level. I was starting to develop one table with everything inside for book (see Wikidata:WikiProject Books/Book data model), but recent discussions in the project didn't convince me to continue. Snipre (talk) 21:10, 15 August 2018 (UTC)

Multiple versions of the same statement

An item can contain multiple versions of essentially the same statement, if there is different information contributed in qualifiers like object named as (P1932) -- see eg the authorship at The history and description of the county of Salop (Q29572671) for an example.

Can the citation templates condense multiple occurrences, if they have the same value? Jheald (talk) 20:06, 19 May 2018 (UTC)

Identifying duplicate items without duplicate IDs

A while ago, for a separate project, I put up all articles from the Royal Society Biographical memoirs and associated titles, with a tracking page at User:Andrew Gray/Royal Society biographies. I looked it over recently and realised that it's probing to be a useful way of seeing how much duplicate uploading we have going on. I've merged a few but left others up for now as a demonstration, eg Franz Bergel. 13 February 1900-1 January 1987 (Q52399202) & Franz Bergel. 13 February 1900-1 January 1987 (Q47480577)

A couple of things are worth noting:

  • This is likely to be more common for older papers in a pre-universal-DOI era - in most cases here they're being imported from Pubmed, which doesn't always have DOIs for older papers, and a DOI-based import won't have Pubmed IDs, so there are no overlapping identifiers.
  • The slight metadata differences may make purely automated matching a bit challenging - note different title punctuation, different author string punctuation (sometimes full name in one and initials in another), different publication date (the Pubmed imports seem to be inferring 1/1/xx for year only data?), different approaches to counting pagination, and sometimes discrepancies with issue numbers (presumably because this journal didn't systematically use issue numbering). All of these are "obviously the same" to a human reader but might cause difficulties for a script.

None of this is a massive problem, of course (I estimate ~10 duplicate records for that title last month, and that was high), but it's something to be aware of and I thought I'd flag it up. I don't know what if anything at the moment is being done to catch and merge multiple uploads. Andrew Gray (talk) 08:42, 3 October 2018 (UTC)

Where do I see the record of Source M.D.?

I ran a batch, but see no evidence that it worked. I cannot find anything about the ISBNs that I listed. I cannot find the books that it might have created based on those ISBNs. I see nothing that indicated the effort produced something. Where should I look? If the "batch created" is relevant, it was 20181010142459. Thank you. -Trilotat (talk) 19:42, 10 October 2018 (UTC)

Done. Thank you. -Trilotat (talk) 14:35, 16 November 2018 (UTC)

Author name strings

I'm new here and very interested in the project, looking forward to using the references database for Wikipedia. Just a technical question (hope this is the right place): Why did you decide to have a simple author name string instead of providing separate fields for surname and given name? When reusing the data (e.g., for Wikipedia), we really need to know which is which. This is an issue if the surname is composed of more than one word (e.g., "K. van Bibber" – any automatic tool would list this incorrectly as "Bibber, K. van" when it is in fact "van Bibber, K."). And this can lead to great confusion for Chinese names, where given name and surname are vice versa to what we are used to in Western societies (except for in American sources, who force them into the Western system). I fear that this issue can make the data quite useless as these cases are actually quite common; how are they handled? --Jens Lallensack (talk) 14:40, 16 November 2018 (UTC)

A similar idea came to my mind, for authors who have an item, the string that is used to credit this author should not be a plain statement but a qualifier of the « author » statement :
 author: (the author item)
     credited as : K. van Bibber
@Jens Lallensack: As you point out the notion of name parts is very cultural. Therefore we should not base any information system on the assumption that any name can be split into a first and last name. See for instance DBLP: Some Lessons Learned which explains why you should avoid to parse names into subfields in a bibliographic information system. − Pintoch (talk) 18:40, 16 November 2018 (UTC)
Thanks for the answers. That is rather disappointing though. Why not providing fields for first and last name optionally, in addition to the author name string? Thinking about it, the lack of this data is a big issue. The citation format "last name, first name" is the standard in both academia and at least the English Wikipedia (I am active in both). Also, in many cases you want to cite the initials of the first names only (in the English Wikipedia, we often go for the initials because we usually do not know the full first name of a number of authors, and we do not want to have a mixture of initials and full names). All this will not be possible with Wikidata if first names are not separated from last names. With this issue, I fear that source metadata from Wikidata will never be widely used by neither academia nor Wikipedia. --Jens Lallensack (talk) 22:36, 16 November 2018 (UTC)
@Jens Lallensack: family name (P734) and given name (P735) on the item for the author may give what you are looking for. Jheald (talk) 09:10, 17 November 2018 (UTC)
Thanks, that looks a bit better, but will not work in the many cases where authors published under different names: We always need to cite the name as it was presented in the source, per convention and for reasons of retrievability. If, for example, an author variously published works both with and without the middle initial, we have to cite the name exactly as it was presented in the respective source, even if we end up with several different variants of the name of a single person in the data. If the author changed his/her name (e.g., marriage), we, again, need to cite the name under which he/she published the respective source. Consequently, the "family name" given in the item of the author is not necessarily what we need to cite. For those reasons, we need to include within the item for the source itself, because this data is source-specific. I see no way around it; but please let me know if I am mistaken, I'm eager to learn. --Jens Lallensack (talk) 10:21, 17 November 2018 (UTC)
One question: I would like to read (and possibly join) the discussions regarding the design of the data models of the different publication types, however I was unable to find them. Does anybody know where these discussions are taking place or whom to talk to? Thanks, --Jens Lallensack (talk) 09:16, 18 November 2018 (UTC)
@Jens Lallensack: these discussions mostly happen through debates around the creation of new properties. On each property you should be able to find a link to the page where it was discussed, indicated with property proposal discussion (P3254). For instance, for author name string (P2093), the discussion is at Wikidata:Property_proposal/Archive/39#P2093. − Pintoch (talk) 09:22, 18 November 2018 (UTC)

Questions about "scholarly articles"

I am the first to admit that I don't understand how batches of scholarly articles are selected to be added Wikidata, but I am puzzled by some items I discovered yesterday. How do we end up with these kinds of items?

  • A "scholarly article" that is a book review of a 1973 book that does not have a Wikidata item, and whose author does not have a Wikidata item. See Joan of Arc (Q58606956). (I added the book, but I wonder how valuable these items are.)
  • A "scholarly article" that is chapter 22 of a book which did not have a Wikidata item (thus no "published in" statement). See Margery Kempe (Q58236291). (I added the book.)
  • A "scholarly article" that is a two-line death notice of a general practitioner who does not meet our notability criteria. (See Robert (“Bob”) Tennant. (Q55527198) and the obituary; another at Margaret Wilson (née Fyfe). (Q46255903).)
  • Most concerning, three items with <instance of> "scholarly article" and <title> "Algebra (English)". These are in fact online editions of multi-chapter algebra textbooks in German, two of which have useful and disambiguating subtitles. (Algebra (Q55869482), Algebra (Q56627314), and Algebra (Q56637998).) (I improved these, but I have not created "work" items to associate with these editions.)

Is this just an inevitable side effect of loading massive amounts of citation data? Are there process improvements that we could make to avoid these? - PKM (talk) 22:12, 15 November 2018 (UTC)

I think we should stop creating new scholarly articles by batches until there is consensus (and resources) to import established corpora with a meaningful scope and a reasonable metadata quality threshold. − Pintoch (talk) 22:29, 15 November 2018 (UTC)
 Support I also stumble upon these massive data imports which have never been checked manually. Wikidata should not be used as data dump. -- JakobVoss (talk) 09:10, 16 November 2018 (UTC)
Speaking in a personal capacity here (as opinions differ wildly in the community on this topic), I'd very much welcome a proposal ensuring that every large-scale data import is linked to documentation including: a specific statement of purpose, a clear impact story (why are we doing this, who's benefiting from this data), expected data quality/maintenance costs or issues, and a well-defined projection on the scope/completeness of the import (in terms of # of items and statements to be created). I have been expressing concerns in the past about the ingestion of sparse, non-random datasets that are not representative of any well-defined catalog. Inferences based on these datasets can be flawed unless there is a notion of completeness or scope built into them. Documenting these imports in an accessible way, and explaining the process behind them, would go a long way in providing visibility and a shared understanding of their purpose. This would be much more useful than a binary decision/RfC-style recommendation as to whether a specific dataset should be allowed to exist or not in Wikidata.--DarTar (talk) 19:47, 4 December 2018 (UTC)

Notifying the project as this is quite important The Source MetaData WikiProject does not exist. Please correct the name.Pintoch (talk) 10:48, 16 November 2018 (UTC)

I do not see any problem here. I would like to see as many articles in Wikidata. From a WikiCite perspective, one major concern of Wikidata is the lack of comprehensiveness: That it does not contain every paper and book. Scholars that visit Scholia would be disappointed when they see that only 10% of their publication are there. If we want to create automated bibliographies on Wikipedia from Wikidata information, then the data should be there. I experience a minor issue when using Magnus Manske's sourcemd: It assumes that every DOI is a scholarly article. That is not the case. I particularly see errors in connection with Springers book series where the invididual books, primarily conference proceedings, are miscategorized as scholarly articles instead of editions. That is, however, something I can live with. Perhaps we should focus on building a tool that will handle the Springer books and chapters. — Finn Årup Nielsen (fnielsen) (talk) 11:27, 16 November 2018 (UTC)
@Fnielsen: I would like to see as many articles in Wikidata. well that is a problem I think, because at the moment Wikidata really cannot handle importing these millions of DOIs. This is putting a significant strain on the servers and degrading the service. As far as I can tell there is no consensus for these batch imports so they should stop. one major concern of Wikidata is the lack of comprehensiveness: That it does not contain every paper and book. Scholars that visit Scholia would be disappointed when they see that only 10% of their publication are there that is not a major concern of Wikidata, it is a major concern of Wikicite. So I see two solutions:
  • Either we import the entire Crossref database (with the appropriate filters to get rid of pathological cases like the ones above) - but good luck with convincing the community and WMDE that this is something Wikidata can be used for - with the current hardware resources this seems unmanageable to me;
  • Or ad-hoc batch imports should stop and Scholia should not advertise Wikidata as a place where you can expect to find all your publications.
In any case it really does not make sense to keep adding batches of publications without a well defined scope, as far as I can tell.
Pintoch (talk) 11:46, 16 November 2018 (UTC)
@Pintoch: Thanks for bringing up these issues. I hope that you raise more issues and encourage others to raise more issues. There are lots of tough questions here which do not have easy answers. If you want direct answers then please ask shorter, single questions in their own sections.
This project, "WikiProject Source Metadata", has participants who upload lots of citations and also write the model for citations. There is a community which might be larger and more organized at meta:WikiCite which is actually seeking to address the challenge of sorting all the academic publications. Although there is overlap in the communities, the composition of the membership of these groups, their goals, and their editing strategies are different. Consider talking to both. The WikiCite community on meta has been organized enough to present 3 conferences, so seems to have some ability beyond this community discussion board.
You asked why those sources are in Wikidata, despite being short passages, single book chapters, or other odd publications. Those passages have an identifier like a doi. They are also part of a strategic subset that someone selected.
You asked about strain on the servers and service. The WikiCite community is treating this with urgency. Comment at Wikidata:WikiCite/Roadmap for how to fix this. Blue Rasberry (talk) 15:21, 16 November 2018 (UTC)
To me the problem is not import of large numbers of bibliographic records but automatic import of large number of bibliographic records without intellectual quality control. Every time I selected a "strategic subset" and imported the data I had to manually go through the list of imported items and correct ugly artifacts such as those noted above. -- JakobVoss (talk)
@Bluerasberry: They are also part of a strategic subset that someone selected. Is there a page somewhere that lists these strategic subsets, and shows that more than one person finds them strategic? Do people file requests for bot tasks where these scopes are discussed? As far as I can tell, people just import their own publications, those of their friends and colleagues, or those of anyone who uses the #icanhazwikidata hashtag on Twitter… Is that the strategy?
I am quite active both in the Wikidata and Wikicite communities, and I am aware of the roadmap. It is great that this discussion is taking place. Sadly, I personally do not have anything to propose to scale Wikicite at the moment. What I find problematic is that while this discussion is taking place, random publication imports keep happening. There are many other ways to contribute to Wikicite: import journals, institutions, publishers, conferences, open access policies, notable researchers… why don't we focus on that instead while we figure out a solution for publications? Let's try to find more Dona Stricklands instead of importing our own publications, distinctions and awards! The WikiCite community on meta has been organized enough to present 3 conferences, so seems to have some ability beyond this community discussion board. The discussion can happen elsewhere, but it must happen on Wikidata too as long as Wikidata is used to host the data. The fact that conferences are organized does not waive anything, I think. − Pintoch (talk) 18:31, 16 November 2018 (UTC)
@Bluerasberry: I can't find any discussions at meta:WikiCite about what should or should not be imported, or how. Are these discussions happening on the email discussion list? All I found was the statement "WikiProject Source MetaData is the place on Wikidata where coordination of these efforts happens." So this is where I posted. Lest anyone be confused, I am 100% supportive of the WikiCite effort. I just want quality entries that can be used as references and linked to their subjects, authors, and the books/journals in which they appear. If there are best practices for modeling scholarly articles, I'd like to see them and participate in improving them. I'll take your advice and post separate questions about specific cases. - PKM (talk) 22:04, 16 November 2018 (UTC)
Let me emphasize too that I love this project and I would be very, very happy if we could import the entire Crossref. But we need a viable plan for that and in the meantime we should not try to import everything we can get away with while the admins aren't looking. − Pintoch (talk) 09:03, 17 November 2018 (UTC)
"This is putting a significant strain on the servers and degrading the service." - Really? Do you have a citation for that? Or a statement from the dev team? Last I heard, they were quite sanguine about the volume of content being added. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 19:11, 20 November 2018 (UTC)
@Pigsonthewing: Well, I am only a user so I only have a partial view on this, but here are the few clues I have:
  • The main resource that needs to be shared is the editing throughput (number of edits made each minute), and Wikicite already takes up a fair share of that. We have seen various incidents recently with the dispatch lag going up, the query service going out of sync, the Wikidata API returning errors and other issues like that.
  • WMDE has had to hack custom code into the search profile to penalize scholarly articles in search results, so that they don't crowd the results for users who look for other items. That is not a good sign.
  • My understanding is that the size of Wikicite in the query service could also potentially slow down unrelated queries (which would therefore time out more frequently), but I don't know Blazegraph enough to be sure of that. Dario (WMF) wrote at Wikidata:WikiCite/Roadmap#Growing_pains that the rapid ingestion of content is taking a toll on the querying infrastructure, causing frequent timeouts.
If I am just making up these concerns then I would be very happy to be told so. In that case I will gladly file a bot request to import all journal articles from Crossref. − Pintoch (talk) 09:58, 21 November 2018 (UTC)
  • Completely agree with @Pintoch:. Dumping DOIs indiscriminately and inciting unnotable authors (like myself) to self-aggrandizement through #icanhazwikidata does not increase the value of Wikidata. It just increases the demoability of Scholia on isolated cases. But it creates usability problems and in time will lead to stricter Notability enforcement, imho.
    • don't assume that everything's roses at CrossRef, just look at their metadata completeness reports. I have as part of Tracking of Research Results (Q56259739), and their resolved authors (orcid) and affiliations (grid) are in single digit percentages. So if we dump all journal articles from CrossRef (50M or so), who's going to clean them and resolve them?
    • Articles should not be dumped if the authors are not resolved. Magnus clearly states in the "author name string" proposal

https://www.wikidata.org/wiki/Wikidata:Property_proposal/Archive/39 that it's a stop gap measure, only to be used for important/reference articles. Don't use it as excuse to dump obituaries and other junk (after the important physician from the obit is created, then the obit could be created as a reference item, not before! Or just use the doi link as reference)

The Source MetaData WikiProject does not exist. Please correct the name. --Vladimir Alexiev (talk) 23:53, 3 December 2018 (UTC)

About the demoability of Scholia - I don't even think importing our own publications makes for better demos. Who would demo Wikidata by first showing an item about themselves? Does Multichill demo the wikiproject Sum of all paintings by showing the Wikidata items about his own works of art? − Pintoch (talk) 00:19, 7 December 2018 (UTC)

Please note that we have some ambiguity regarding items of so-called proccedings. If you scroll these links you can see that all these documents from the Proceedings of the Astronomical Union were imported as scholarly articles but here you can read that "The IAU does not require that manuscripts of individual papers for the Proceedings have to be refereed. Editors are free to do so if they wish, e.g., with the help of their SOC as referees, as long as this does not delay the publication of their Proceedings".--Alexmar983 (talk) 00:58, 10 January 2019 (UTC)

We seem agreed that a really catholic library database has some serious advantages, but that its sheer size is a serious issue. This seems like a problem in need of a technofix.
WikiCite has been discussing some possible solutions at Wikidata:WikiCite/Roadmap, including splitting off a sister project. The discussion opens "The growth of WikiCite beyond its original scope has been causing a number of scaling issues, both technical and social, that need to be addressed". I also raised this issue in May 2015; Magnus Manske suggested a separate project, and Snipre argued it was too early for that.
I've heard, uncontradicted by my limited experience, that Postgres deals better with large datasets. Does anyone more knowledgeable have suggestions?
Crossref also has the odd accuracy problem, and proofreading bibliographic databases is tedious. I'd feel better about it if others also benefitted from my work. For this reason, I'd say integration with bibliographic software used by scholars is valuable. This rather pushes towards a broader database. I also have a use case that rather requires including poor-quality non-RS sources; flagging shill academic articles that are actually camouflaged ads with conflict-of-interest metadata (motivation: lots of these ads are, appallingly, cited as reliable, independent sources in WP). HLHJ (talk) 06:33, 12 January 2019 (UTC)
HLHJ if I had ever been to a Wikicite conference, I would have told since the beginning that a separate platform was more suited for the long-term issue than wikidata. But since it continued in that direction, and considering its versatility, I have no big problem in the general framework nowadays. Basically, I can teach only one platform which is handy. In any case, I am here whatever is the future roadmap...--Alexmar983 (talk) 12:01, 13 January 2019 (UTC)

Cleanup subpages

The list of subpages is an unusable mess. See Wikidata:Requests_for_deletions#Bulk_deletion_request_of_outdated_WikiCite_Listeria_pages and help cleaning up outdated content. -- JakobVoss (talk) 09:10, 16 November 2018 (UTC)

The pages have been deleted but current subpages still consist many material of unclear value. If pages have been created 2016 and dormant since or tasks listed there have been done, we should better archive and summarize their content. -- JakobVoss (talk) 06:17, 18 November 2018 (UTC)
+1. Coming into this project four years in, it's hard to tell from the many subpages what has actually been completed and what would be helpful to work on. - PKM (talk) 23:27, 19 November 2018 (UTC)

Short death notices

Robert (“Bob”) Tennant. (Q55527198) and Margaret Wilson (née Fyfe). (Q46255903) are examples of two-line death notices of medical practitioners who do not meet our notability criteria. Should these items be flagged for deletion (and more importantly, if they are deleted, will some other process likely add them back in because their PubMed IDs are "missing" from Wikidata?) Is there value to including these items? - PKM (talk) 22:11, 16 November 2018 (UTC)

@PKM: I think the best place to start is "measure the value" rather than "is there a value". I hope that to start you will grant a value greater than 0, because we have a person, an obituary in a reliable source, and multiple facts about the person.
There are 3 notability criteria is at Wikidata:Notability. #1 is linking to other wiki articles, like a Wikipedia article. This person does not have a Wikipedia article, so that is a fail. The other two criteria, "clearly identifiable" and "fulfills structural need" seem like passes to me. Fewer than 1% of physicians get an obituary in a medical journal. To me that makes these two seem like plausible candidates for being high importance in their field. These Wikidata entries have some value now.
To get maximal value out of this in the long term we need Wikidata items for both the subject of the obituary and the obituary itself. The item for the person is much more valuable if we can fill out properties including year of birth and death, place of residence, occupation, and institutional affiliates for education and work. Since they were the subjects of obituaries, maybe they accomplished something significant in their lives, and maybe not, but at minimum with just the content in the obituary we get useful insights.
For example, if anyone queried the count of physicians who were prestigious enough to enter the Wikidata media record and who were practicing in the 1950s, then we can get insights into the ratios of how many globally at that time were female, what ethnicities get recognition, what fields of medicine were most represented, what locations put their physicians of that era into the media record, and what hospitals / schools have ties to historical personages. We can establish the permanent public global record of humanity here, and it probably is the case that some hospitals have records of physicians in decades past and some hospitals seemingly left no media footprint.
A near future plan for Wikidata is to query for a university or hospital and profile them to exhaustion for whatever everyone associated with them did, the demographic breakdown of whomever got media recognition representing them, and the demographic breakdown for whomever benefited from their output.
How would you measure the cost versus benefit of this? How would you feel about collecting every obituary in every academic journal? Blue Rasberry (talk) 15:24, 18 November 2018 (UTC)
@Bluerasberry:. Okay, you've convinced me that these items have potential value. As you say, unlocking that potential value is dependent on someone or some process teasing the biographical data out of these notices to create items for the physicians. - PKM (talk) 21:08, 18 November 2018 (UTC)
@Bluerasberry: You have not convinced me, sorry. If someone is actually interested in tracking those people, they'll create them first, before creating their obits. They'll go to https://www.bmj.com/content/333/7557/48.4, pay 30 EUR, and parse sentences like "Former general practitioner Highland area (b 1928; q Aberdeen 1951; DCH, DPH)" to create a person and some facts from his life. Now tell me do you truly believe this will happen within the next 20 years, and why does WD need such dead weight (no pun intended). It's easy to dump DOIs, it's much harder to do something useful with the data. --Vladimir Alexiev (talk) 00:06, 4 December 2018 (UTC)

A definition I had to learn about "intelligence" was "Intelligence is what the intelligence test measures.". A scholarly article is what is published in scholarly publications.. Now some of these publications have little merit, some have a lot of merit I am not a scientist and I am not there to judge. When I work on them, I link articles to authors or authors to publications and generate both thanks to what ORCID knows as people's publications. In this way a fine web is weaved. The problem described here is one where people assume that individual articles or authors are assessed. They are not. I just worked on a chemical award and for those awardees with an ORCiD identifier I submitted a job to add publications and co-authors. Literally hundreds of edits are made as a consequence. They are known good thanks to ORCiD. They do and could include publications Wikipedia could use to prove its points but they prevent one publication to be exclusively claimed when they are not. Thanks, GerardM (talk) 12:03, 9 December 2018 (UTC)

This is a Wiki and this Wiki will host all the citations of all Wikipedias

To all of you that talk about notability, one objective is to include all references of all Wikipedias. That is in itself a project that is going on with a database separate from Wikidata and it being ingested in phases. All the issues raised above are as appropriate for this subset but being a subset, you will not seed the wood for the trees. Leaving it as a subset will not have us see all papers on a subject, it will make authors with retractions not show. We will be left with information like a stamp collection.

The process of cleaning up data is largely based on available information from ORCiD. We import many, many authors and their publications from there. This process is involved and it does link publications to authors directly but only for authors who have a public record. As a consequence the Scholia information becomes comprehensive and the authors gain their notability through their work. When you look at properly processed Scholia information, you get a lot of information including co-authors, subjects, where people published, date lines and citations.

There is no such information available to us elsewhere. It is important to have this.

What I always find funny when people complain about problems is their lack of Wiki perspective. This is a Wiki and it is allowed to be incomplete and not always correct. The point is that we acknowledge that Wikidata is a work in progress. We can all note that a lot of effort goes in the development of citation data and their is a vision why it makes sense to have it. To put it bluntly, thanks to all this work, scientists who are open about their work gain notability, they will be more likely to be cited in Wikipedia and it will help us to find a neutral point of view as we will know the literature on a subject, any subject. Thanks, GerardM (talk) 06:11, 4 December 2018 (UTC)

Beyond "Thank you all" I do not understand your sketch. --Succu (talk) 22:48, 4 December 2018 (UTC)
@GerardM: Your comment is contradictory: how can you think that WPs will use WD data if you claim "it is allowed to be incomplete and not always correct" ? WP won't use WD until WD can prove that its data are correct and well maintained, so instead of trying to import always more data, better stop data import and curate existing data even if the dataset is quite small. And instead of speaking about WD future perhaps can we discuss about the real use of WD data by Wikipedias: the status is the same among main Wikipedias, WD data is considered as unreliable, due to bad data import and lack of data protection against vandalism, and not fulfilling the local Wikipedias rules so so massive data from WD are currently used. Just some examples: all RfCs in WP:en finished with no agreement to use WD, infoboxes using WD are regularly replaced in WP:fr,... Snipre (talk) 12:48, 5 December 2018 (UTC)
You’re incorrect to assume datas won’t ever be used by any major Wikipedia (as you did also recently on PC) and I don’t really understand where such strong statements are going. The problem with your approach is that you don’t explain where you will find the manpower to curate datas if … wikipedians are not involved. Few contributes directly to Wikidata, and free datasets won’t fall out of the sky. author  TomT0m / talk page 13:27, 5 December 2018 (UTC)
@TomT0m: Please explain me where my reasoning wrong when considering that WPs clearly mentioned the unreliability of WD as one of the major drawbacks to use it as data source and at the same time some WD contributors that having wrong data in WD is not a problem.
The claim of GerardM is typically an argument for the opponents of WD use in WP, so we just give reason to those who are looking for WD weaknesses. As long we adopt this kind of strategy, how do you want to convince WP to use WD ?
Please read the results of the last RfC and explain why the major lua infobox using WD data in WP:fr can be systematically replaced by infoboxes using local data and not the inverse ?
If you want to have wikipedians curating WD data, you need to hear what are their demands concerning data quality and act in consequence. Just comment please the following sentence from the closing comment at the end of the RfC "...if Wikipedia wants to use data from Wikidata, there needs to be clear assurances on the reliability of this data" vs. the claim of GerardM. Where am I wrong by saying that having the position of GerardM is one of the major problems to solve if we really want to provide what WPs are requesting ? Snipre (talk) 16:24, 5 December 2018 (UTC)
@Snipre: Your statements are far too strong, as for example frwiki uses Wikidata in infoboxes or for works item citation in bibliographies ( fr:Modèle:Bibliographie has about half a thousand of inclusions, which is not bad considering you have to find an item id for the work to find and this is not user friendly). But my point is, why would have datas a need to be curated if you allows anly imports from reliable databases ? If the alternative is either « Wikidata is perfect and Wikipedias uses it » and « Wikidata is not perfect and Wikidata is not used » we simply are going nowhere. There is a middle point and we already are somewhere in between. But I don’t think your radical position helps to find it. author  TomT0m / talk page 16:40, 5 December 2018 (UTC)
@TomT0m: Please provide me a link to a RfC or any other community in WP:fr allowing to use WD data in an unconditional way ? Use of WD data is tolerated not completely accepted. And your example is very good one: can you provide me the link towards the discussion which decided to use WD data in infobox Modèle:Bibliographie ? This discussion is mandatory according to the last RfC about the use of WD data. Your example is the correct description of WD use in WP: use which is restricted by some strong constraints, limited use or limited to special topics.
And please read again what I said: I never said we have to have perfect data, but we can't accept errors. We have to work correctly from the beginning and put the correct tools to avoid errors. This doesn't mean errors can't occur. Change of mind has to happen: the time when everythiny could be imported is finished and data has to be checked in some way before importations, tools to curate data has to be developed, people have to integrate quality in their contributions. I saw nothing corresponding to that in the claim of GerardM, but I read the inverse instead. Snipre (talk) 10:45, 6 December 2018 (UTC)
@Snipre: As far as I know, quality has not been that a concern in the RfC’s of frwiki and the restrictions would be the exact same one if Wikidata had stricter policies. And I’m not talking about how things are supposed to be in theory but how they are in practice. (Plus data quality is not a concern is using the « bibliographie » template, it’s a usability one as if it’s not used transcluded in a template it implies the use of a QID in the plain code of the page, which is the main concern of the frwiki community.) author  TomT0m / talk page 11:08, 6 December 2018 (UTC)

Wikipedia is a Wiki and as such incomplete and not necessarily always correct. The notion the Wikidata will be useful only when it is complete and perfect is just an opinion. When Wikidata includes all citations of all Wikipedias, it will provide a substantialy superior service and not using it/ considering its use will be just silly. This is NOT about all the issues Wikipedians come up with, this is about sources and citations, consequently info boxes are not a consideration. Thanks, GerardM (talk) 14:15, 5 December 2018 (UTC)

@GerardM: Sorry but did you spend some time in the other WPs ? When someone can show that the WP quality is highest that the WD quality, how can you expect they will use WD data ? WPs are working harder than us to provide better quality, they can integrate directly data from reference sources using bot so large WPs don't require WD.
If you want to sell a product, you have to be sure that your product is corresponding to a demand. What is the main demand of WPs ? Just read that summary to see what wikipedians are looking for. Snipre (talk) 16:24, 5 December 2018 (UTC)
So you want me to acknowledge the opinions some Wikipedians have about Wikidata.. Even though in this context it is not relevant.. Fine.
First, like Commons, Wikidata has a symbiotic relation with Wikipedia. Like commons, Wikidata is not only about Wikipedia for both Wikipedia represents a subset of what it has to offer. When Wikipedians through there lack of appreciation what a Wiki is, reject Wikidata, they do not understand how Wikidata can help with their disambiguation. Particularly in lists there is an error rate of over 4% and such issues could be found with tooling that has been suggested for years now.
Second, Wikipedias all of them, are a subset of what Wikidata has to offer. A substantial number of red links in a Wikipedia includes a lot of information. It is not offered to Wikipedia readers and imho Wikipedia does a disservice to the motto: "the sum of all knowledge". Arguably this still reflects Wikipedia but what happens with "these papers and authors" is that they are imported from ORCiD as a consequence for authors and papers a "Scholia" is build up. I am building up Scholias particularly for scientists on Twitter, scientists in the news, my aim is to build awareness that the Scholia information, information that is free, reflects the merit of a scientist and the authors they cooperate with.
Third, bravo to the quality drives of Wikipedians. We grieve the cost we incur because it is at the expense of cooperation and collaboration. When I read what English Wikipedia has to say in what it calls the "2018 state of affairs" I find little connection to what I perceive as the potential of Wikidata and the blinders Wikipedians willingly wear.
For as long as the staff of Wikimedia consider Wikidata as secondary to Wikipedia, the same is true for Commons by the way, and Wikipedians have this inflated sense of importance of Wikipedia, it makes that I do not care to "sell" Wikidata as a product. For me both Wikipedia, Commons and Wikidata are not products and I am not going to sell them as such. Thanks, GerardM (talk) 06:09, 6 December 2018 (UTC)
PS @Snipre: Where is their and your self reflection.. Do they/you not understand that this attitude is parasitic? Thanks, GerardM (talk) 07:10, 6 December 2018 (UTC)
@GerardM: Sorry but do you read what you write ? "Wikidata has a symbiotic relation with Wikipedia." A symbiosis means both participants agreed to work together. So please provide me the agreement of WP:en to use WD as open source for data.
You can say what you want about the mentioned RfC, by considering it as opinions of some Wikipedians, but RfC are currently the common way to express the WP community opinion, so please respect that process.
And if you don't understand that WPs were able to work years without WD and that a majority of Wikipedians are ready to continue to work like that as they don't see the advantages of WD, I think we can close the discussion. Snipre (talk) 10:57, 6 December 2018 (UTC)
<grin> Do you know what service Wikidata provides to Wikipedia? </grin> Apparantly not. From the start of Wikidata, all interwiki links are organised in Wikidata, it provides a superior service to Wikipedia. As to my appreciation, tell me WHY I am wrong not that I have to respect something I keep away from for good reasons. No, symbioses does not mean agreement, it is how things effectively are. Thanks, GerardM (talk) 11:24, 6 December 2018 (UTC)

Plan S open metadata feedback request

The Source MetaData WikiProject does not exist. Please correct the name.

The Plan S initiative aims to make academic articles open access, and their metadata CC0, as a condition of funding. It is requesting feedback about itself on these questions:

  1. Is there anything unclear or are there any issues that have not been addressed by the [Plan S] guidance document?
  2. Are there other mechanisms or requirements funders should consider to foster full and immediate Open Access of research outputs?

It seems to me as though people here may have useful answers. The Plan S draft requires specific metadata to be machine-readable and CC0-licensed; the precise nature of the data required is under discussion. Feedback is open until the 8th of February. There is an attempt to develop a wiki consensus statement for submission on the seventh. I would be surprised if none of you have thoughts to contribute.

The plan launched in September and has a large proportion of European research funders and a couple of US ones onside; if you are affiliated with a research funder, they might want to look into it. The best comment on Plan S I've heard so far comes from Elsevier (which doesn't really like the financial transparency provisions, for starters). An Elsevier spokesman said "If you think that information should be free of charge, go to Wikipedia" ("Als je vindt dat informatie gratis moet zijn: ga naar Wikipedia"). I'm not sure if he knew about the journals published here.

Could someone please also pass this along to WikiCite, through their off-wiki channels, as appropriate? HLHJ (talk) 02:19, 28 January 2019 (UTC)

Rapid grant to enhance the ProveIt gadget

I just requested a rapid grant to enhance the ProveIt gadget, a popular reference manager for Wikipedia. As people interested in reference technology, I thought you may be interested in leaving a question, comment, idea or endorsement. Thanks! Felipe (talk) 20:26, 4 February 2019 (UTC)

Employer start/end date problem - ORCID ingest

@Trilotat: @JesseW: The Source MetaData WikiProject does not exist. Please correct the name.

Hi all,

In recent weeks, I have worked on a large import of data to add employer (P108) statements if they are missing from existing items with an ORCID iD (P496). It has come to my attention that there is a problem that has resulted in two or more start dates to be added to some of these items. This was almost certainly caused by an oversight on my part.

This happened when adding batches via QuickStatements where, for example, the first and third employer are the same but with different start and end times. I think QuickStatements is behaving as expected - creating a statement for employer 1, then a statement for employer 2, but for employer 3 the base statement already exists (i.e. employer 1), so the qualifiers are added to that statement instead of creating a new claim for employer 3. All of the data are correct but the structure is wrong. In this scenario, is there a way to make QuickStatements create a new claim for employer 3?

Unfortunately, duplicate dates can be found on approximately a quarter of the employer statements added during this piece of work - some 62,500 items in total - see this query. I can and will fix this but I want to ask if anyone with a bot account can help out as removing the qualifiers containing superfluous dates through the API is probably the most efficient solution here?

Going forward, we need to consider how we create and maintain good quality data for authors of the academic publications we are ingesting. I identified almost 500,000 items with an ORCID ID and no employer or affiliation. Many of these items are sparse in terms of statements. But, I estimate, we can easily retrieve good quality data from ORCID for around 50% of these items. Can we develop a tool to deal with the ingest at the point when an item is harvested from ORCID?

I am interested to hear your thoughts on this and I apologise for any inconveniences caused by this error. Simon Cobb (User:Sic19 ; talk page) 22:38, 11 February 2019 (UTC)

@Sic19: I think that ORCID profiles are about 50% blank, except for a name and ORCID iD. Among the rest the content can be sparse, duplicative, and incorrect. There are lots of good profiles also with correct information.
You are not at fault because you are ingesting the best available data that the world has to offer, and it is messy for now.
Can you show what is wrong with having two start times for the same employer? If someone leaves and comes back, then there should be two start times, right? Blue Rasberry (talk) 22:51, 11 February 2019 (UTC)
Yes, very much so but without a separate claim for each spell of employment it becomes difficult to determine which start and end date are paired and consequently the correct order of the employers. For example, Helen Freeman (Q60023412) started at the University of Leeds (Q503424) on 1 January 2013 and simultaneously held two positions from 1 April 2014 until 1 February 2016. Not sure if it is possible to write a query that will produce the correct employment history from this data - machine readability is probably a good way to test whether it is acceptable. Simon Cobb (User:Sic19 ; talk page) 01:17, 12 February 2019 (UTC)
This discussion might be relevant (it links to a workaround for what I think is a similar problem). - PKM (talk) 07:12, 12 February 2019 (UTC)
Yes, it is the same problem - I'll take a closer look at the workaround soon. Simon Cobb (User:Sic19 ; talk page) 00:04, 14 February 2019 (UTC)

@Sic19: Very cool! What tool are you using to do this? It would be nice if we could get an orcid 2 wikidata type thing that includes all the data up and running at some point. Mvolz (talk) 09:10, 12 February 2019 (UTC)

This query to find items without an employer and with an ORCID. Then retrieve employment data from the ORCID API https://pub.orcid.org/v3.0_rc1/[ORCID ID]/employments. I use either Excel or OpenRefine to retrieve and structure the data and OpenRefine for reconciliation. Ping me if you need any help. Simon Cobb (User:Sic19 ; talk page) 00:04, 14 February 2019 (UTC)
@Sic19: Have you considered using EditGroups to undo the faulty import batches? Also, uploading your edits directly via OpenRefine could potentially help here, as it behaves differently to QuickStatements with regard to statement matching (it takes qualifiers into account). See Wikidata:Tools/OpenRefine/Editing/Tutorials/Working_with_APIs for an example workflow to import employers. − Pintoch (talk) 10:14, 13 February 2019 (UTC)
Unfortunately, this is not consecutive edits or confined to a small number of batches. If the data were completely wrong I would revert entire batches for sure but there are no problems with 150,000+ items in these batches and even the employers with extra dates are correct - it's just the qualifiers that contain errors. Uploading from OpenRefine is a good suggestion. I tend not to do this for large batches because I want to use OpenRefine for preparing data but I could do the uploading while I'm out at work or sleeping. A general problem I've encountered during this work is not being able to ingest data quickly enough. Simon Cobb (User:Sic19 ; talk page) 00:04, 14 February 2019 (UTC)

Update: This will be resolved soon. There are now only 1,800 remaining statements with duplicate dates, which I am in the process of removing, and I will be adding the employer claims again with a single value for the start and end time qualifiers. I will also continue to work on the sparse items created from ORCID and, particularly, I intend to focus on employment and education history if data are available from the API. If you spot any problems or have concerns about this work please send me a message. Simon Cobb (User:Sic19 ; talk page) 21:59, 19 February 2019 (UTC)

@Vladimir Alexiev: No, as a minimum I import the current/latest employment from ORCID. Ideally, I would import all employment records but there are several difficulties I'm encountering. First, reconciliation is not always possible - many ORCIDs include employment at Faculty/Department/School level and we haven't got this level of granularity for every institution at the moment. Similarly, many non-academic organisations appear in the employment history which do not have a Wikidata item - I have not attempted to create these items. Second, I do not attempt to import every career change at a single institution because of the complexity of the subclases of academic (Q3400985), which vary according to country. Instead, I am trying to add a single claim for the total period of employment - if there is a significant gap, I add two claims to reflect this. Here is an example that I hope highlights this problem: http://orcid.org/0000-0002-0089-6930 - Carlos Costa (Q58709272) has worked at University of Aveiro (Q29671) since 1990-09-21 continuously (allowing for the 8 week break in 1997) but that is not immediately clear from the ORCID data. Obviously, this is an edge case but data structured like this are common in the ORCID employments. Third, time, my technical skills and computational power are constraints on this work.
I see this work more as an initial attempt to improve the sparse items created by semi-automated processes rather than a complete data import. We are going to need a process to check these items for updates and this work will need to become much more automated. We also have something like 100,000 items which do not have employment records available from ORCID.
Concerning education, I have data for approximately 150,000 items that lack this claim and will start working on the reconciliation/ingest soon. Simon Cobb (User:Sic19 ; talk page) 20:15, 24 March 2019 (UTC)