Wikidata:Requests for permissions/Bot/Josh404Bot 2
The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved --Lymantria (talk) 05:10, 11 May 2021 (UTC)[reply]
Josh404Bot 2[edit]
Josh404Bot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools) Operator: Josh404 (talk • contribs • logs)
Task/s:
Fill missing TMDb person ID Property:P4985 that have an associated IMDb ID Property:P345 via the TMDb API.
Code:
https://github.com/josh/wikidatabots/blob/acc6d2015e5d4d1a3b16515657bf01d5ad5ad0fb/P4985.py
Function details:
This is a follow up to Wikidata:Requests_for_permissions/Bot/Josh404Bot_1 which operated on TMDb movie ID Property:P4947.
- Via SPARQL, find items have have a IMDb ID Property:P345 but do not have a TMDb person ID Property:P4985. Only consider those that are a valid given P4985's type constraints.
- Use TMDb API's to lookup the person ID via the IMDb ID.
- For any matches, add a new statement for the item.
Let me know if I should publish some test edits for consideration.
Thank you! Hopefully I've drafted this proposal better this time.
--Josh404 (talk) 00:59, 4 May 2021 (UTC)[reply]
- generally looks good. there's a remote chance some item will have two imdb ids and your code will look them up and insert twice (depends how you do batching). so I would suggest deduplicating that. BrokenSegue (talk) 03:35, 4 May 2021 (UTC)[reply]
- Are you thinking about the race condition of the bot starting, then some other edit coming in before the batching runs? QuickStatements will at least prevent duplicate statements from being added. Josh404 (talk) 03:43, 4 May 2021 (UTC)[reply]
- no it's not a race. your only de-duplication is the distinct in SPARQL but you are distincting with a random and the imdb id added. also multiple imdb ids can map to the same TMDb ID (I think). I didn't realize QS prevents duplicate inserts so my nitpick is withdrawn. Support BrokenSegue (talk) 10:47, 4 May 2021 (UTC)[reply]
- Ah, I see, yeah, I wish there was a better way to do random samples. Though, not entirely sure about this, but if `RAND()` is only evaluated once by the query engine, duplicate QIDs and the constant seed will still MD5 hash to the same value resulting in the same `random` binding. I think? But yeah, the random sort is a bit of a hack.
- Yeah, its definitely possible for two qid items to have the same IMDb. A constraint conflict for sure, but they could exist. I think in that case I would still like treat both separately. I don't want the bot to make any judgement calls on which is more preferred. I think fixing those via constraint report review is the way to go.
- User:BrokenSegue, Thanks for your thoughts! I appreciate someone looking at the actual code. Josh404 (talk) 16:01, 4 May 2021 (UTC)[reply]
- No I mean one item can have two IMDb IDs but those could both map to the same TMBb ID (on top of the problem of one item having two identical IMDb IDs). BrokenSegue (talk) 16:35, 4 May 2021 (UTC)[reply]
- For the first case of item has multiple IMDb id statements, as I understand `wdt:`, that `wdt:P345` will only match the top preferred statement. Any additional statements will never be considered for this bot's matching algorithm. Is that incorrect understand how prop direct works? Just trying to understand a way the current query can return multiple results that share the same item qid.
- On the TMDb side, two distinct IMDB IDs can not be associated with the same TMDb ID. It's only a single association. Additional two distinct TMDb records can not share a IMDb ID. There's a unique database constraint over `tmdb_id` and `imdb_id`. Josh404 (talk) 16:52, 4 May 2021 (UTC)[reply]
- @Josh404: there might not be a single top preferred statement though. see Vladimir Luxuria (Q258305). your query will return them twice I think. BrokenSegue (talk) 16:59, 4 May 2021 (UTC)[reply]
- Oh wow! Your totally right! Check this out https://w.wiki/3H7q
- Just tested the bot on that specific case of Vladimir Luxuria (Q258305), only 119358 was found. But definitely possible the other could have matched. That potentially make a second statement.
- What are you thoughts on the expected behavior? Implementation aside. 1) Should the bot potentially submit 2 TMDb ID statements if both matched? They would be at least distinct. 2) Or pick just the first? Seems to arbitrary. 3) Or even bail out if there's multiple associated. Josh404 (talk) 17:15, 4 May 2021 (UTC)[reply]
- @Josh404: I would suggest option 1 though I had previously assumed it was possible to have 2 IMDb IDs for one TMDb ID and so doing 1 right would mean deduplicating. In any case this is all a minor nitpick. Didn't mean to draw this out so long. BrokenSegue (talk) 19:58, 4 May 2021 (UTC)[reply]
- :) no worries. Thanks for working though that with me. Josh404 (talk) 20:52, 4 May 2021 (UTC)[reply]
- @Josh404: I would suggest option 1 though I had previously assumed it was possible to have 2 IMDb IDs for one TMDb ID and so doing 1 right would mean deduplicating. In any case this is all a minor nitpick. Didn't mean to draw this out so long. BrokenSegue (talk) 19:58, 4 May 2021 (UTC)[reply]
- @Josh404: there might not be a single top preferred statement though. see Vladimir Luxuria (Q258305). your query will return them twice I think. BrokenSegue (talk) 16:59, 4 May 2021 (UTC)[reply]
- No I mean one item can have two IMDb IDs but those could both map to the same TMBb ID (on top of the problem of one item having two identical IMDb IDs). BrokenSegue (talk) 16:35, 4 May 2021 (UTC)[reply]
- no it's not a race. your only de-duplication is the distinct in SPARQL but you are distincting with a random and the imdb id added. also multiple imdb ids can map to the same TMDb ID (I think). I didn't realize QS prevents duplicate inserts so my nitpick is withdrawn. Support BrokenSegue (talk) 10:47, 4 May 2021 (UTC)[reply]
- Are you thinking about the race condition of the bot starting, then some other edit coming in before the batching runs? QuickStatements will at least prevent duplicate statements from being added. Josh404 (talk) 03:43, 4 May 2021 (UTC)[reply]
- I will approve the request in a couple of days, provided that no objections will be raised. Lymantria (talk) 07:31, 8 May 2021 (UTC)[reply]