Wikidata:Requests for permissions/Bot/VIAFbot 4
- The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.
- Approved ·addshore· talk to me! 07:37, 21 July 2013 (UTC)[reply]
VIAFbot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: Maximilianklein (talk • contribs • logs)
Task/s: Populate labels of items using information from the "Alternate name forms" section found through following the VIAF ID (P214).
Function details: For each item containting VIAF ID (P214) load the "alternate name forms. For each alternate name determine the source language then perform 2 tasks:
- If the alternate name is in a language that already has a label compare the two labels using the Levenshtein distance (Q496939) technique. If they are sufficient different (using statistical machine learning) add an 'also known as'.
- If the alternate name is in a language that has no label, create the label for that language.
Just using the Library of Congress records, preliminary research suggests, that there would be this many success cases per language: "zh": 429507, "en": 5305, "ar": 25755, "he": 90316, "ko": 50496, "ja": 6151, "ru": 4474, "fa": 3635, "el": 3804. (Plus many languages with less than 3,000) And we can also use German National Library data to supplement. --Maximilianklein (talk) 17:06, 28 June 2013 (UTC)[reply]
- Sounds cool, can you do the test edits ? --Zolo (talk) 19:16, 8 July 2013 (UTC)[reply]
- Test edits. These include translation differences and name variations. [1], [2], [3], [4], [5]
- Details. For now I'm only using AKAs from Library of Congress, but can later do so for Deutsche National Bibliotech. And I'm only publishing AKAs that have Levenshtein ratio < 0.3 from the label. I determined 0.3 from machine learning training program I wrote and trained. Maximilianklein (talk) 20:27, 17 July 2013 (UTC)[reply]
- It seems to work fine.
- Apperently, the Library of Congress uses the <surname> <name> order, at least in some cases]. It may make sense to switch them, or add both forms. Also, I think aliases could be added to other languages using the same alphabet. The valid aliases for Rimbaud in English or German are most probably the same as in French. --Zolo (talk) 12:48, 18 July 2013 (UTC)[reply]
- @Zolo:, after more thinking, I am not going to add aliases that are perfect Fistname, Lastname / Lastname, Firstname permutations of each other. It seems like a waste of space on Wikidata unless there are at least some character differences. I also think that your point about sharing aliases about languages is a good feature. But its is complicated about which languages can share which aliases. I would rather do that as a separate task.
- I've scanned the first 1,000 Wikidata items now. Any more objections before a larger run? Maximilianklein (talk) 00:50, 19 July 2013 (UTC)[reply]
- @Zolo:, after more thinking, I am not going to add aliases that are perfect Fistname, Lastname / Lastname, Firstname permutations of each other. It seems like a waste of space on Wikidata unless there are at least some character differences. I also think that your point about sharing aliases about languages is a good feature. But its is complicated about which languages can share which aliases. I would rather do that as a separate task.
- Approved ·addshore· talk to me! 07:37, 21 July 2013 (UTC)[reply]