Wikidata:Requests for permissions/Bot/William Avery Bot 3
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved --Lymantria (talk) 10:55, 2 January 2022 (UTC)[reply]
William Avery Bot 3[edit]
William Avery Bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: William Avery (talk • contribs • logs)
Task/s:
Developed in response to Wikidata:Bot requests § Import Treccani IDs. Please see that request for discussion and pointers to some test edits.
This task will import IDs relating to Treccani's Enciclopedia on line (Q65921422) and related web properties, working on instance of (P31) of human (Q5) only.
Code:
- treccaniScraper.py and other code at https://bitbucket.org/WilliamAvery/wikipythonics
Function details:
The bot will be 'fed' with QIDs of instance of (P31) of human (Q5) that have one of the Treccani IDs, but not another. I source these from SPARQL queries but put them in a table in a user database to process in small batches. I have found that the Treccani website cross-references entries for human beings reliably, but that is not the case for countries or cities, which have cross-references for preceding or succeeding entities that Wikidata treats separately. Humans are also by far the largest category of entities to which this overall requirement applies.
- For each entity:
- Determine an existing Treccani ID property that the item has. (one of Treccani ID (P3365), Treccani's Enciclopedia Italiana ID (P4223), Treccani's Dizionario di Storia ID (P6404), Treccani's Biographical Dictionary of Italian People ID (P1986), Treccani's Dizionario di Filosofia ID (P7993))
- Get the corresponding page from https://www.treccani.it/ and load it into a Beautiful Soup (Q2893296) object
- For each URL in the "Altri risultati per ..." section of the page
- Use a regular expression to determine whether it relates to one of the four other Treccani web properties under consideration.
- If it does:
- Retrieve the page given by the URL and load it into a Beautiful Soup (Q2893296) object
- Extract the relevant ID from the URL
- Find any existing property on the entity that holds this ID, or create a new one if none exists.
- Extract values for qualifiers from the retrieved page:
- publication date (P577) - Not always present.
- author name string (P2093) - Not applied if there are existing values for this qualifier or author (P50) on an existing property. Requires adjustments to replace dash separators with commas. Surnames, in all caps, are converted to title case. Not always present.
- subject named as (P1810) - Title given on the web page.
- volume (P478) - Appendix or volume. Not always present or applicable.
- Update the entity with new or amended property values.
- References [functionality added after comments below]
- A reference will be placed on each property added, with appropriate values of stated in (P248) and retrieved (P813).
- Where there is an existing Treccani ID property on an item that doesn't have such a reference, and this bot verifies that the stated Treccani ID is correct, a reference with those properties will be added.
There are further edits made using the script in interactive mode. --William Avery (talk) 15:09, 2 December 2021 (UTC)[reply]
- Not 100% sure I follow and haven't read the source. Could we see some examples? BrokenSegue (talk) 19:23, 6 December 2021 (UTC)[reply]
- Oh I see there are examples already. Looks good. I'd prefer to see a reference on the added claims (in particular a retrieved date for when you did the scraping) but not 100% necessary. BrokenSegue (talk)
- You are right - there should be references. I have amended my code to add references to new properties that I am adding, and add references to existing Treccani ID properties that I happen to be verifying as part of this process. I will do some more testing, amend the process description, commit my changes to my repository and ping you at some point, if that's OK. Here is a sample diff of what the script is doing now. William Avery (talk) 19:32, 7 December 2021 (UTC)[reply]
- @BrokenSegue: I have updated the bot script and added a note about references to the "Function details" above. I made ten test edits with my bot account. Sometimes it just adds a reference to an existing property, as here, other times it finds cross referenced articles on the Treccani website and adds a bunch of properties, as here. William Avery (talk) 14:13, 9 December 2021 (UTC)[reply]
- You are right - there should be references. I have amended my code to add references to new properties that I am adding, and add references to existing Treccani ID properties that I happen to be verifying as part of this process. I will do some more testing, amend the process description, commit my changes to my repository and ping you at some point, if that's OK. Here is a sample diff of what the script is doing now. William Avery (talk) 19:32, 7 December 2021 (UTC)[reply]
- Oh I see there are examples already. Looks good. I'd prefer to see a reference on the added claims (in particular a retrieved date for when you did the scraping) but not 100% necessary. BrokenSegue (talk)
- This has been sitting for a while without approval. I'm going to approve this in 48 hours if nobody else comments. BrokenSegue (talk) 18:53, 1 January 2022 (UTC)[reply]