User talk:PangolinMexico/Outreachy 2

From Wikidata
Jump to navigation Jump to search

Hi @PangolinMexico: This looks great! task_two.py fulfils the requirements of Task 2, so you can mark this as completed for Outreachy now!

auto_add_name.py also looks great! I was running a similar script a while back [1] - however it ran into problems with non-western names. For example, at Tak Kim (Q89869354) you added family name (P734)=Kim (Q157414), but there are also Kim (Q13083721) and Jin (Q718600), each of which are different in the local languages but translate to the same name in English. It's really difficult to figure out which one of those to add as a result. Can you think of a way to avoid issues like that?

For task 3, please use The South Pole Telescope (Q55893751), you can find the bibtex for this article at [2].

(Also pinging @Pigsonthewing: as co-mentor.) Thanks. Mike Peel (talk) 09:52, 5 April 2022 (UTC)[reply]

Thank you so much for the feedback! :) I completely agree with you regarding the problems with non-western names... I have a couple of ideas in regards to 'checks' that could be implemented to improve this:
1. If a page is only missing a given name/family name, take a look at the name that is available and check its origin via the 'language of work or name' property (Property:P407), then if any of the potentially missing names have the same value for this property.
2. If both names are missing, it might be worth finding different places were the author is cited/how they are cited there? Either on wikidata or in academic databases at large... this is obviously a much more complex task.
3. A lot of non-western names had 'suspicious names' or 'unfound names' parts - if this is notified to the user immediately, it's possible this might provide information regarding what kind of name is being dealt with.
By the way, if it's OK to ask, I have a question about your script:
I noticed you're reading a file called 'populate_family_names_cache.csv', which you call the family names database. It is then used to find a family name from a page's labels (similar to my own script). If it's ok to ask, what is this file? Does it contain all family names available in wikidata? That would fix my issue with the limited search results! I'm curious what it contains/how to find it.
Thanks again! Looking forward to working on Task 3 ASAP. PangolinMexico (talk) 12:44, 5 April 2022 (UTC)[reply]
@PangolinMexico: Those sound like good approaches. With bot work like this, there are two ways that you can work: either completely automatically with no input, but then you have to be >99.9% sure that the edits are good (e.g., skipping any where it's not completely clear), or semi-automated where the code asks you to double-check before saving, which doesn't have to be as accurate but can take a lot of your time up to run it! Or you can put the output into a Wikidata game and get others to help resolve them. I think (1) and (2) could be done fully automatically, (3) could maybe skip suspicious/unfound names to work automatically, or ask the user for each of them to run semi-automatically. For the cach file, it's generated using a Wikidata query, you can see the (python2) code for that at [3] - the idea was that it would save repeated queries. Thanks. Mike Peel (talk) 09:45, 11 April 2022 (UTC)[reply]