Wikidata:Requests for permissions/Bot/SmartifyBot
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved--Ymblanter (talk) 17:11, 8 November 2021 (UTC)[reply]
SmartifyBot[edit]
SmartifyBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Rob Lowe - Smartify (talk • contribs • logs)
Task/s: I am collaborating with the Yale Center for British Art to upload all their public domain artworks (not just paintings) to Wikidata and Commons. There are approximately 42,000 works to load.
Code: Uses pywikibot and is based on artdatabot.py and wikidata_uploader.py by Multichill (talk • contribs • logs). The code is rather specific to YCBA art present, but I will attempt to generalise and release.
Function details: See code mentioned above. But in more detail the code:
- Creates a new Wikidata item for the the artwork ...
- Adds claims for:
- instance of
- image
- inception
- location
- title
- creator
- made from material
- collection
- inventory number
- width
- height
- depth
- copyright license
- described at URL
- Creates a linked Wikimedia Commons {{Artwork}} item
All the Commons data is derived from the Wikidata item except for the medium. YCBA often have very lengthy medium descriptions that are not easily. expressed in Wikidata, e.g. Aquatint and etching on medium, slightly textured, cream wove paper. The full text is used in Commons while a subset of the terms is used for the Wikidata item.
Although I had requested and received a bot flag for Commons I had not done so for Wikidata (sincere apologies for that), but I had already uploaded approx. 6000 items before that was noticed!
An example of an upload may best demonstrate what has been done, here in Wikidata:
and here in Commons:
More information about the project and in particular the use of CC0 for the copyright licence can be found on the request page for the SmartifyBot in Commons.
--Rob Lowe - Smartify (talk) 16:59, 18 May 2021 (UTC)[reply]
- The example you provided has a constraint violation. Optimally a bot would not be creating such violations. Also, I would be interested in knowing how you are preventing the bot from making duplicate items (say if we already have an item for this piece of art). You are using the wrong item for JPEG (JPEG (Q2195) instead of JPEG File Interchange Format (JFIF) (Q26329975)) in Brighthelmstone, England (Q106862449). And why do multiple items you created have the same value for Commons compatible image available at URL (P4765)? See Brighthelmstone, England (Q106859734) and Brighthelmstone, England (Q106862449). They also have the same invetntory number so they should be merged? BrokenSegue (talk) 12:31, 19 May 2021 (UTC)[reply]
- Hi BrokenSegue, thanks for your comments. I’ll address the simpler things first:
- - I’ve changed the code to add the necessary qualifiers to the P973 statements. In my defence, except for the Mona Lisa and a few other famous works it’s had to find a use of described at URL (P973) that doesn’t have a warning triangle by it.
- - I’ve changed the code to make use of JPEG File Interchange Format (JFIF) (Q26329975) rather than JPEG (Q2195) to describe the JPEG files. But in any event these statements are transitory. The bot runs in two phases, it adds the artwork with Commons compatible image available at URL (P4765) and later loads the image to Commons, removes the P4765 and creates the image (P18) link to Commons.
- - The far more serious issue is the one of duplicates. I thought I had taken a lot of steps to avoid duplicates, but it seems there is a problem. It’s worthwhile commenting on some history. In 2012 Google Arts and Culture added about 5000 images to Commons. Wikidata items were added for about 2100 of them in 2016. A few more Commons items have been added since then. To make sure I didn’t add duplicates to Commons I identified all the Yale works in both Wikidata and Commons and edited at least 150 manually to add accession number details so they could be identified reliably.
- The Smartify bot starts by using a sparql query to discover the existing works and avoid the them. As a newcomer to Wikidata bots I wonder if I have made a rookie error in operating the bot. After adding a bunch of Wikidata items, how quickly are they visible to a subsequent sparql query? I think it is likely that I’ve stopped the bot and restarted it and the recently added records have not been discovered by the query and so duplicates have been produced. I’ll definitely tighten this up.
- Unfortunately about 400 duplicates have been produced. What is the best way to get them removed? Provide a list? A SparQL query that list them should also be possible. Rob Lowe - Smartify (talk) 13:46, 21 May 2021 (UTC)[reply]
- Duplicate items should be merged. You can do this manually or with a bot. I don't know of a tool that takes a list and does it automatically though. BrokenSegue (talk) 17:26, 21 May 2021 (UTC)[reply]
- I can understand a merge might be appropriate where the items have come from two different sources and you want to capture unique information from both. But here the duplicate has exactly the same author and information, created twice in error. Is deletion not better in this instance? I suppose you could overwrite the item in its entirety with information about a new artwork - the history would look a bit peculiar though. Rob Lowe - Smartify (talk) 17:28, 23 May 2021 (UTC)[reply]
- You can request bulk deletion at WD:RFD but it's probably better just to merge them. We simply don't know if any external entities picked up the Wikidata item ids, no matter how short their existence, and a merge allows them to be corrected. Bovlb (talk) 23:45, 25 May 2021 (UTC)[reply]
- @ Bovlb: I've requested the deletion of 297 items, because I believe that's the correct way to proceed in this instance, my reasoning is on the WD:RFD page. But the request doesn't seem to have been actioned, nor has it been categorically rejected. So I'm not sure what to do... Rob Lowe - Smartify (talk) 18:00, 2 June 2021 (UTC)[reply]
- @Rob Lowe - Smartify: I replied again there to explain in more detail why I recommend merging and oppose deletion in this case. Cheers, Bovlb (talk) 19:52, 2 June 2021 (UTC)[reply]
- @ Bovlb: I've requested the deletion of 297 items, because I believe that's the correct way to proceed in this instance, my reasoning is on the WD:RFD page. But the request doesn't seem to have been actioned, nor has it been categorically rejected. So I'm not sure what to do... Rob Lowe - Smartify (talk) 18:00, 2 June 2021 (UTC)[reply]
- You can request bulk deletion at WD:RFD but it's probably better just to merge them. We simply don't know if any external entities picked up the Wikidata item ids, no matter how short their existence, and a merge allows them to be corrected. Bovlb (talk) 23:45, 25 May 2021 (UTC)[reply]
- I can understand a merge might be appropriate where the items have come from two different sources and you want to capture unique information from both. But here the duplicate has exactly the same author and information, created twice in error. Is deletion not better in this instance? I suppose you could overwrite the item in its entirety with information about a new artwork - the history would look a bit peculiar though. Rob Lowe - Smartify (talk) 17:28, 23 May 2021 (UTC)[reply]
- Duplicate items should be merged. You can do this manually or with a bot. I don't know of a tool that takes a list and does it automatically though. BrokenSegue (talk) 17:26, 21 May 2021 (UTC)[reply]
- As I got tagged on this one: Can you please publish your code? Multichill (talk) 19:10, 21 May 2021 (UTC)[reply]
- I will sort something out. Rob Lowe - Smartify (talk) 17:28, 23 May 2021 (UTC)[reply]
- Hi @ Multichill: sorry for the delay, other projects intervened. The source code is here. It hooks into the Smartify database to get the artworks - I haven't provided that code, but I hope it is fairly clear what is going on. It makes use of a modified version of your artdatabot.py code. I've extended it in a few places which I've marked with my initials, RML. The changes are to:
- - allow other types of artwork, not just paintings and multiple instance of (P31) statements, so something can be an etching but also a print
- - allow dates of the form 'after 1856'. I will probably need to add more date handling in due course.
- - allow multiple made from material (P186) statements, not just oil on canvas
- - allow multiple described at URL (P973) statements
- With regard to the duplicates mentioned above, you can see in smartifybot.py that it gets a list of all the existing works in Wikidata using a sparql query and avoids them; except it seems, not reliably. I was uploading works in batches of a few 100 at a time, and all I can think is that the sparql query is not returning recently added items so if I immediately started a second batch the existing works were not detected and a duplicate inserted. Could you advise on this, is it a possible scenario? Rob Lowe - Smartify (talk) 18:00, 2 June 2021 (UTC)[reply]
- I will sort something out. Rob Lowe - Smartify (talk) 17:28, 23 May 2021 (UTC)[reply]
- Rather than using P973, dedicated properties would be preferable. If they don't exist yet, they can be proposed at Wikidata:Property_proposal/Authority_control. --- Jura 08:06, 30 May 2021 (UTC)[reply]
- Jura makes a good point here. It deserves a response. Bovlb (talk) 21:34, 23 June 2021 (UTC)[reply]
- @Ymblanter: This request seems to be languishing. The OP has responded to almost all of the points raised. What do we need here to move forward? Bovlb (talk) 21:34, 23 June 2021 (UTC)[reply]
- I would like all the points in this discussion at least to be addressed (ideally, agreed upon, but this is sometimes impossible). The operator has not edited since 2 June.--Ymblanter (talk) 05:58, 24 June 2021 (UTC)[reply]
- Hi @Ymblanter:. I was rather hoping for a response to the code I posted and a question I asked before proceeding too much further. The question was this. After adding items to Wikidata using pywikibot how quickly are they accessible to a sparql query? Is there an appreciable delay? If there is that would explain the duplicates that have been created. I was adding batches of say 500 at a time. Starting a 2nd batch soon after the 1st may not have registered all the items added by the 1st and produced the duplicates.
- With regard to the described at URL (P973) statements I can amend the bot to use dedicated properties. The link to the artwork in Smartify is easy enough. The Yale one is more complicated. There already is Yale Center for British Art artwork ID (P4738) but as far as I'm aware links of this form have never been the correct way to access Yale artwork web pages (although they currently work with a redirect). But Yale are in the process of putting a new CMS and web site live (in the next month) and I was told in a recent meeting with Yale that P4738 links definitely won't work after that. The old links that will be preserved by the new site are of this form https://collections.britishart.yale.edu/catalog/tms:34. Do I add a new property for this? What happens to the old property and all the Wikidata pages that use it?
- @Rob Lowe - Smartify: Lag from Wikidata to the query service varies a lot, but is usually under an hour. Grafana In particular, high-rate editing across many items tends to drive it up.
- Regarding Yale Center for British Art artwork ID (P4738), be aware the identifier is more enduring that the specific URLs. We have a property formatter URL (P1630) that turns an identifier into a URL and we can easily change that, providing the underlying identifier scheme remains consistent. Do you know if that is the case here? CC @Jura1 Bovlb (talk) 20:16, 25 June 2021 (UTC)[reply]
- @Bovlb: Ok, a lag in the query service, even by a minute or two, would explain the duplicates the bot has created. I’ll change the bot to keep a record in the Smartify database of works linked to Wikidata items, so duplicates can be avoided.
- With regard to the described at URL (P973) statements I can amend the bot to use dedicated properties. The link to the artwork in Smartify is easy enough. The Yale one is more complicated. There already is Yale Center for British Art artwork ID (P4738) but as far as I'm aware links of this form have never been the correct way to access Yale artwork web pages (although they currently work with a redirect). But Yale are in the process of putting a new CMS and web site live (in the next month) and I was told in a recent meeting with Yale that P4738 links definitely won't work after that. The old links that will be preserved by the new site are of this form https://collections.britishart.yale.edu/catalog/tms:34. Do I add a new property for this? What happens to the old property and all the Wikidata pages that use it?
- With regard to use of Yale Center for British Art artwork ID (P4738) I have spoken further with Yale. Existing links of the form http://collections.britishart.yale.edu/vufind/Record/1667701 will still function for old works but are deprecated. New works should definitely use this form https://collections.britishart.yale.edu/catalog/tms:34. Is it possible to carry on using the existing P4738 and add additional formatter URL (P1630) and format as a regular expression (P1793) statements to it and setting the new statements to preferred? The description of P1630 seems to imply that you can. but I suspect all the old style Wikidata artworks would just break and generate something hybrid like https://collections.britishart.yale.edu/catalog/tms:1667701. Rob Lowe - Smartify (talk) 20:01, 29 June 2021 (UTC)[reply]
- @Rob Lowe - Smartify: You can certainly expect typical lags of at least a few minutes. And, in case my earlier remark was too cryptic, the act of operating a bot will tend to increase lag.
- http://collections.britishart.yale.edu/vufind/Record/1667701 is now a redirect to https://collections.britishart.yale.edu/catalog/tms:34 . It looks like the Yale Center have dropped the previous identifier scheme, and the two schemes are incompatible, but we can (at least for the moment) use the redirects to convert old values to new ones. At this point we could create a new property for the new scheme, but this raises questions: Was the old scheme ever intended to be a persistent identifier scheme? Is the new one? Is the "tms:" part of the identifier? Do the two schemes have names or version numbers? We need to find out more before we rush to create a new property. Bovlb (talk) 22:34, 29 June 2021 (UTC)[reply]
- @Multichill: In the interests of getting this unstuck, do you have any comment to offer on the code? I was about to point out that it's missing a licence (we prefer our bots to be some kind of open source, if at all possible), but I note that the code it is derived from also lacks a licence.
- @Jura1: It would obviously be better if the bot's contributions used an appropriate identifier property instead of described at URL (P973), but it sounds like Yale Center for British Art artwork ID (P4738) is a mess right now. Obviously someone needs to communicate with them, and we appreciate Rob's assistance in that regard, but it seems unfair to make them solely responsible for it as a precondition for bot operation. Bovlb (talk) 15:43, 2 July 2021 (UTC)[reply]
- If Yale Center for British Art artwork ID (P4738) isn't applicable, it shouldn't be used. There seem to be two other ones needed. It's a fairly straightforward to propose a new external-id property. It can be created within a week. That's much quicker than it takes the operator to respond. --- Jura 10:01, 3 July 2021 (UTC)[reply]
- @Jura1: Apologies for delay in responding, I've been on holiday with no signal. I've added a proposal for the Smartify external-id. Can I confirm there is no way to add a new pattern to Yale Center for British Art artwork ID (P4738) and that a new external-id is required. Any suggestions as to what should it be called - 'Yale Center for British Art artwork ID - type 2'? Rob Lowe - Smartify (talk) 11:58, 23 July 2021 (UTC)[reply]
- Hi @Jura1: and @Bovlb:, my request for a property for Smartify Ids over a week ago has gleaned just one vote of support. How many are required? Perhaps you could add your support since it is at your request I'm asking for the property. Also should the new Yale ID be named as I suggested above, 'Yale Center for British Art artwork ID - type 2', is that appropriate? Rob Lowe - Smartify (talk) 19:18, 27 July 2021 (UTC)[reply]
- One is sufficient if there are no opposes nor arguments to be addressed. Usually I add "(former scheme)" to the old one. If Yale's has names for the identfifers, use these. --- Jura 08:15, 29 July 2021 (UTC)[reply]
- @Jura1: Thanks for your advice, I've added a proposal for Yale Center for British Art artwork Lido ID. Note: YCBA use LIDO to provide details of their works in XML form. Rob Lowe - Smartify (talk) 22:13, 29 July 2021 (UTC)[reply]
- One is sufficient if there are no opposes nor arguments to be addressed. Usually I add "(former scheme)" to the old one. If Yale's has names for the identfifers, use these. --- Jura 08:15, 29 July 2021 (UTC)[reply]
- With regard to use of Yale Center for British Art artwork ID (P4738) I have spoken further with Yale. Existing links of the form http://collections.britishart.yale.edu/vufind/Record/1667701 will still function for old works but are deprecated. New works should definitely use this form https://collections.britishart.yale.edu/catalog/tms:34. Is it possible to carry on using the existing P4738 and add additional formatter URL (P1630) and format as a regular expression (P1793) statements to it and setting the new statements to preferred? The description of P1630 seems to imply that you can. but I suspect all the old style Wikidata artworks would just break and generate something hybrid like https://collections.britishart.yale.edu/catalog/tms:1667701. Rob Lowe - Smartify (talk) 20:01, 29 June 2021 (UTC)[reply]
- Hi @Ymblanter:, @Jura1: and @Bovlb:. The properties Smartify artwork ID (P9787) and Yale Center for British Art artwork Lido ID (P9789) have now been created. I have modified the bot code to add new artworks using these. In addition I have made changes to use a different JPEG property and to make sure the bot does not produce duplicates. Existing duplicates have now been merged. Is it OK to add a few artworks to test things and gain approval - even without a bot flag? Once I have the flag I will update the existing 6674 items to use the new properties before moving on to add more bulk content. Rob Lowe - Smartify (talk) 20:53, 22 August 2021 (UTC)[reply]
- Yes, sure, please make some test edits without a flag.--Ymblanter (talk) 20:56, 22 August 2021 (UTC)[reply]
- Hi @Ymblanter: I've just tried to create a single artwork and the bot is still blocked from 16th May. Could you remove the block or is there something else I need to do?
- Done--Ymblanter (talk) 19:06, 26 August 2021 (UTC)[reply]
- Hi @Ymblanter: Thanks for enabling the bot. Sorry for delay since, holidays intervened. I've used the bot to upload 5 more works as examples Tree Study (Q108665176), Minster (Q108665505), Study of a Girl (Q108665776), Cupid and Psyche (Q108296716) and Portrait Study of Martha, Lady Hayter (Q108665569). I hope that's enough to allow the bot flag to be allocated. The critical code change though was extra checks to stop duplicates being created. Rob Lowe - Smartify (talk) 15:18, 23 September 2021 (UTC)[reply]
- Given that there is now a property, the format of the references should be adapted as well: (sample change).
- Maybe @Multichill: wants to added something about the P31 value to use/description. --- Jura 22:31, 26 September 2021 (UTC)[reply]
- Hi @Jura1: and @Ymblanter: I can change all the references to use Yale Center for British Art artwork Lido ID (P9789) if you think it essential. But ... nobody else seems to be doing it. Of the 7051 paintings created/modified this September most of them, 6480, are using reference URL (P854), 556 are referencing the general collection, just 15 use a specific id property. Of the 6480 records using P854 most have an id property to use - they just don't bother. Rob Lowe - Smartify (talk) 09:58, 30 September 2021 (UTC)[reply]
- Which bots? --- Jura 08:41, 1 October 2021 (UTC)[reply]
- @Jura1: In clarification, I was looking specifically at references attached to the inventory number (P217), where a painting possesses one, and for records that had been created or modified in the past month, so they may have been created some time ago. They will have been created by a large number of bots and users. The Sparql query I used to look at last month's mods is below, so you can take a look at the bots for yourself. The first 15 records or so use an inventory number reference using a specific property id. The vast majority of the rest use reference URL (P854).
- Which bots? --- Jura 08:41, 1 October 2021 (UTC)[reply]
- Hi @Jura1: and @Ymblanter: I can change all the references to use Yale Center for British Art artwork Lido ID (P9789) if you think it essential. But ... nobody else seems to be doing it. Of the 7051 paintings created/modified this September most of them, 6480, are using reference URL (P854), 556 are referencing the general collection, just 15 use a specific id property. Of the 6480 records using P854 most have an id property to use - they just don't bother. Rob Lowe - Smartify (talk) 09:58, 30 September 2021 (UTC)[reply]
- Hi @Ymblanter: Thanks for enabling the bot. Sorry for delay since, holidays intervened. I've used the bot to upload 5 more works as examples Tree Study (Q108665176), Minster (Q108665505), Study of a Girl (Q108665776), Cupid and Psyche (Q108296716) and Portrait Study of Martha, Lady Hayter (Q108665569). I hope that's enough to allow the bot flag to be allocated. The critical code change though was extra checks to stop duplicates being created. Rob Lowe - Smartify (talk) 15:18, 23 September 2021 (UTC)[reply]
- Done--Ymblanter (talk) 19:06, 26 August 2021 (UTC)[reply]
- Hi @Ymblanter: I've just tried to create a single artwork and the bot is still blocked from 16th May. Could you remove the block or is there something else I need to do?
- But consider, even the most famous painting in the world Mona Lisa (Q12418) has an inventory number reference that uses P854. The Mona Lisa has a huge number of Identifiers associated with it, the most significant probably being the Atlas ID (P1212) and Joconde work ID (P347), but nobody seems to have bothered to use those, it's just good 'ole P854. And of course Mulitchill's code which has been used to upload a huge number of collections, calls his artdatabot.py which uses ... P854 for all its references.
- Rob Lowe - Smartify (talk) 14:33, 1 October 2021 (UTC)[reply]
# References for paintings created/modified in the last month SELECT ?artwork ?artworkLabel ?accession ?date ?refnodeLabel ?ref ?refLabel WHERE { SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } ?artwork wdt:P31 wd:Q3305213. ?artwork schema:dateModified ?date. FILTER(((YEAR(?date)) = 2021) && ((MONTH(?date)) = 9)) ?artwork p:P217 ?statement. ?statement ps:P217 ?accession. ?statement prov:wasDerivedFrom ?refnode. OPTIONAL { ?refnode pr:P854 ?ref. } OPTIONAL { ?refnode pr:P856 ?ref. } OPTIONAL { ?refnode pr:P195 ?ref. } OPTIONAL { ?refnode pr:P248 ?ref. } OPTIONAL { ?refnode pr:P143 ?ref. } } ORDER BY ASC(?refLabel) LIMIT 10000
- @Ymblanter: Hi Ymblanter, is there any progress in being granted a bot flag, or is it dependent on this discussion about the use of reference URL (P854). Rob Lowe - Smartify (talk) 09:53, 5 October 2021 (UTC)[reply]
- I am inclined to wait for a couple of days and approve the bot.--Ymblanter (talk) 18:45, 5 October 2021 (UTC)[reply]
- Can we sort this out fist? --- Jura 10:59, 10 October 2021 (UTC)[reply]
- I am inclined to wait for a couple of days and approve the bot.--Ymblanter (talk) 18:45, 5 October 2021 (UTC)[reply]
- The question isn't as much if there are reference URLs properties in the reference section, but if one or the other bot adds incorrectly the "reference URL" property instead of (or in addition to) the dedicated Wikidata property. If you look at the two somewhat random sample at Wikidata:Project_chat#Is_there_a_good_reason_such_edits_are_technically_even_possible?, 9 out of 10 or 11 out of 12 do it correctly. --- Jura 10:59, 10 October 2021 (UTC)[reply]
- @Jura1:The quest for a bot flag has gone on long enough. If you think using Yale Center for British Art artwork Lido ID (P9789) for the references is better/safer than using reference URL (P854) I'll make the change and submit a couple more examples for your perusal. Rob Lowe - Smartify (talk) 16:23, 11 October 2021 (UTC)[reply]
- @Ymblanter: Hi, can I just confirm the SmartifyBot login has not been disabled or blocked again. I'm trying to upload some items to satisfy Jura1's queries. Thanks for your help. Rob Lowe - Smartify (talk) 21:00, 25 October 2021 (UTC)[reply]
- Hi @Ymblanter: and @Jura1: Please ignore connection issues mentioned above. I've uploaded two further works Como (Q109283283) and A Scene on Mount Olympus (Q109284699) that use Yale Center for British Art artwork Lido ID (P9789) rather than reference URL (P854) for the references. I hope that's what you were after. Rob Lowe - Smartify (talk) 13:20, 27 October 2021 (UTC)[reply]
- Hi @Ymblanter: sorry to bother you again. Could we make some progress here and allocate the bot flag, I've made all the changes requested but had no response. Rob Lowe - Smartify (talk) 16:46, 8 November 2021 (UTC)[reply]