Wikidata talk:Lexicographical data/Archive/2021/03
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion. |
English Wiktionary languages
As we finally got the occasion to index all languages of Wikisource, I figured I should do the same for Wiktionary. Looking at https://en.wiktionary.org/wiki/Category:All_languages it counts 5272 subcategories: https://petscan.wmflabs.org/?psid=18537705
- ca. 4500 categories now have an item and its language defined at Wikidata (with topic's main category (P910))
- Some ca. 1000 were created today, others in 2017, but had been missing its P910 till today.
- ca. 500 categories have an item, but its language isn't defined at Wikidata: https://petscan.wmflabs.org/?psid=18537703
- Some may already have its language defined on the Wiktionary category page with a Wikidata item (?!)
- ca. 180 categories have yet to be connected to an item (I skipped some of the proto-languages): https://petscan.wmflabs.org/?psid=18537695
wiktionary:Wiktionary:Statistics/generated has detailed stats and counts 4205 "language headers", 2800 with less than 10 "gloss definitions". --- Jura 00:48, 1 March 2021 (UTC)
Focus languages for improvements to the lexicographic extension of Wikidata and Abstract Wikipedia
Hi. We would like to find two or three language communities who would be good matches to help to start and guide some long-term improvements to the lexicographic data part of Wikidata, and the closely related work in the Wikifunctions wiki and the Abstract Wikipedia project, over the next few years. Participating communities will hopefully find that this project will lead to long-term growth in content in Wikipedia and Wiktionary in and about their language. See Wikidata:Lexicographical data/Focus languages for more information. Please help us identify potential good matches. More details are on that page. Thank you! Quiddity (WMF) (talk) 00:09, 4 March 2021 (UTC)
Demonyms created as items, instead of lexemes
Haligonians (Q104720536) is listed for deletion at Wikidata:Requests_for_deletions#Q104720536. It's currently an instance of demonym (Q217438). --- Jura 21:45, 8 March 2021 (UTC)
30 Lexic-o-days, events and challenges about lexicographical data
Hello all,
I'm glad to announce that 30 Lexic-o-days, a series of events, projects and challenges around lexicographical data, will start on March 15th. There will be discussions, presentations, but also activities like improving the documentation of Lexemes or editing challenges. The goals of this event is to gather people editing Lexemes to have discussions around the content and work together. You can find the schedule and all relevant links on this page.
This format is a first experiment and its content is powered by the community: if you have ideas or wishes for the discussions, you're very welcome to set up an appointment or to create a task on the related Phabricator board! We're also keeping an open list of ideas here. Discussions about Lexemes, or summaries of future discussions that will take place during the event, should be documented on the project page or its talk page.
If you have questions or need help to participate, feel free to contact me. I'm looking forward to your participation! Cheers, Lea Lacroix (WMDE) (talk) 12:32, 9 March 2021 (UTC)
Lexicographic coverage of Wikipedia
lang | Forms in Wikidata | Forms in Wikipedia | Tokens | Covered forms | (%) | Missing forms | Covered tokens | (%) | Missing tokens | Notes |
---|---|---|---|---|---|---|---|---|---|---|
en | 64494 | 963849 | 1529657229 | 41654 | 4.3 | 922195 | 1345425070 | 88 | 184232159 | |
et | 1606320 | 131017 | 16958786 | 72103 | 55 | 58914 | 13378652 | 78.9 | 3580134 | |
da | 31671 | 112469 | 31303226 | 17096 | 15.2 | 95373 | 24553257 | 78.4 | 6749969 | |
sv | 145092 | 224857 | 72905148 | 39899 | 17.7 | 184958 | 55812149 | 76.6 | 17092999 | |
fr | 17572 | 554837 | 480077448 | 25562 | 4.6 | 529275 | 356747841 | 74.3 | 123329607 | |
de | 21654 | 1035287 | 609589775 | 15705 | 1.5 | 1019582 | 381764040 | 62.6 | 227825735 | |
sk | 65104 | 119565 | 18106551 | 35858 | 30 | 83707 | 10763899 | 59.4 | 7342652 | |
ru | 909795 | 795720 | 292922734 | 158112 | 19.9 | 637608 | 136397968 | 46.6 | 156524766 | |
he | 324189 | 250049 | 77382629 | 53709 | 21.5 | 196340 | 35040268 | 45.3 | 42342361 | |
cs | 106648 | 273215 | 73639616 | 26526 | 9.7 | 246689 | 32663946 | 44.4 | 40975670 | |
es | 4069 | 450682 | 410174431 | 5763 | 1.3 | 444919 | 169742085 | 41.4 | 240432346 | |
it | 547 | 399799 | 285251736 | 860 | 0.2 | 398939 | 93932894 | 32.9 | 191318842 | |
pl | 15161 | 386124 | 118098912 | 7071 | 1.8 | 379053 | 37465675 | 31.7 | 80633237 | |
no | 5090 | 158509 | 49548917 | 2499 | 1.6 | 156010 | 12431637 | 25.1 | 37117280 | (w/ 10+ tokens) |
fi | 7067 | 294157 | 47543118 | 6404 | 2.2 | 287753 | 11613274 | 24.4 | 35929844 | |
nl | 183 | 277692 | 131747011 | 280 | 0.1 | 277412 | 25134658 | 19.1 | 106612353 | |
bn | 27490 | 649854 | 13474320 | 5478 | 0.8 | 644376 | 2109462 | 18.6 | 11364858 | selection slightly different |
pt | 226 | 244299 | 159495159 | 347 | 0.1 | 243952 | 25751182 | 16.1 | 133743977 | |
hi | 151 | 54671 | 18940269 | 106 | 0.2 | 54565 | 2859937 | 15.1 | 16080332 | |
ms | 358 | 58423 | 16381541 | 521 | 0.9 | 57902 | 2446411 | 14.9 | 13935130 | |
ca | 116 | 182535 | 109381764 | 97 | 0.1 | 182438 | 11036460 | 10.1 | 98345304 | |
hr | 54 | 137158 | 28734051 | 52 | 0 | 137106 | 1311223 | 4.6 | 27422828 | |
uk | 735 | 421935 | 115177791 | 710 | 0.2 | 421225 | 4254846 | 3.7 | 110922945 | |
lv | 38 | 60997 | 8034762 | 36 | 0.1 | 60961 | 264719 | 3.3 | 7770043 | |
bg | 166 | 124069 | 33507484 | 152 | 0.1 | 123917 | 347412 | 1 | 33160072 | |
ro | 24 | 134905 | 41483262 | 38 | 0 | 134867 | 337091 | 0.8 | 41146171 | |
ar | 202 | 248809 | 69904516 | 35 | 0 | 248774 | 245740 | 0.4 | 69658776 | |
id | 13 | 113157 | 40564473 | 13 | 0 | 113144 | 111944 | 0.3 | 40452529 | |
fa | 45 | 107982 | 45328313 | 27 | 0 | 107955 | 111378 | 0.2 | 45216935 | |
hu | 131 | 290408 | 65863565 | 102 | 0 | 290306 | 133238 | 0.2 | 65730327 | |
ko | 24 | 292283 | 34022746 | 22 | 0 | 292261 | 74177 | 0.2 | 33948569 | |
lt | 34 | 101819 | 13119499 | 20 | 0 | 101799 | 16277 | 0.1 | 13103222 | |
tr | 21 | 165383 | 30066752 | 35 | 0 | 165348 | 39628 | 0.1 | 30027124 | |
el | 9 | 132176 | 40493491 | 9 | 0 | 132167 | 4024 | 0 | 40489467 | |
sl | 2 | 115894 | 19773777 | 3 | 0 | 115891 | 2133 | 0 | 19771644 | |
sr | 9 | 201413 | 42242364 | 2 | 0 | 201411 | 682 | 0 | 42241682 | |
th | 3 | 28208 | 2090490 | 2 | 0 | 28206 | 634 | 0 | 2089856 | |
tl | 1 | 22614 | 3604273 | 2 | 0 | 22612 | 1092 | 0 | 3603181 | |
vi | 5 | 71119 | 75800297 | 6 | 0 | 71113 | 6510 | 0 | 75793787 |
Wikidata:Lexicographical_coverage has some some interesting stats. Above a summary of these. There is some discussion about it at Wikidata_talk:Lexicographical_coverage. --- Jura 13:28, 9 March 2021 (UTC)
Prospection around a « verb conjugation » gadget
Hi, I have a little project to create a gadget to show conjugation for verbs using Wikidata datas. So I tried stuff in SPARQL, like this query to find the forms of verbs in french in different tenses:
See a first attempt in french :
select ?verbe ?verbeLabel ?tempsLabel ?LabelArticle ?repre {
values (?temps ?rangTemps ?tempsLabel) {
(wd:Q192613 1 "présent")
(wd:Q442485 2 "passé simple") # passé simple
(wd:Q17081589 2 "passé simple(fr)")
(wd:Q1475560 3 "futur simple")
(wd:Q1336020 5 "passé composé")
}
values (?article ?rangArticle ?LabelArticle) {
(wd:Q51929218 1 "je")
(wd:Q51929369 2 "tu")
(wd:Q51929447 3 "il/elle")
(wd:Q51929447 4 "il")
(wd:Q52431970 5 "elle")
(wd:Q51929290 6 "nous")
(wd:Q51929403 7 "vous")
(wd:Q51929517 8 "ils/elles")
(wd:Q52432983 9 "ils")
(wd:Q52433019 10 "elles")
}
?verbe ontolex:lexicalForm ?forme ;
dct:language wd:Q150 ;
wikibase:lemma ?verbeLabel .
?forme wikibase:grammaticalFeature ?temps ;
wikibase:grammaticalFeature ?article ;
ontolex:representation ?repre
# SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} order by ?verbe ?rangTemps ?rangArticle
(Either my query is bad or there is not much verbs forms in french right now.)
I wondered how to generalize this query to allow other languages in french, so I tried to build another query that finds the personal pronouns in all languages :
select ?trait_grammatical ?mot ?lang {
?lex wikibase:lemma ?mot ; dct:language ?lang .
?lex ontolex:sense/wdt:P5137 ?trait_grammatical .
?trait_grammatical wdt:P279* wd:Q690940 .
}
Far from complete as well. Same question, is my query bad ? it queries item for this sense (P5137) to link the lexeme senses to a grammatical person (Q690940) . Is this a good approach ? author TomT0m / talk page 15:47, 13 March 2021 (UTC)
- @TomT0m: thanks for this query, it allowed me to find and correct mistakes. first-person singular (Q51929218) should not be used as grammatical feature, it should be first person (Q21714344) + singular (Q110786) (like passé simple (Q17081589) should not be used). You can look at bouger (L10251) for a not-too-bad example (some details still need to be improved, see Talk:Q12547192 for instance). PS: third-person masculine plural (Q52432983) and third-person feminine plural (Q52433019) are indeed grammatical persons in French but they're not used for conjugation flexions. Cdlt, VIGNERON (talk) 11:24, 24 March 2021 (UTC)
- @VIGNERON: mmm I think I’d prefer to keep forms such as first-person singular (Q51929218), it would make the query easier (by putting them as subclass of the other non combined items you can achieve the same effect I guess) author TomT0m / talk page 11:35, 24 March 2021 (UTC)
- @TomT0m: well, for obvious reasons (no new items to create, translate and manage) and since both method are equivalent, it has been decided to go wiht the easier one and no to use combined items, see Wikidata talk:Lexicographical data/Archive/2019/12#Use of combined grammatical features like “second-person plural” or Wikidata talk:Lexicographical data/Archive/2018/06#Missing items for grammatical features for instance. BTW, right now, we only have 47 persons, most of them don't have many data, not even labels in English nor French :( and some others are not even in subclasses (I may improve them a bit but I already have a lot to do and days only have 24 hours). Cheers, VIGNERON (talk) 14:15, 24 March 2021 (UTC)
- @VIGNERON: mmm I think I’d prefer to keep forms such as first-person singular (Q51929218), it would make the query easier (by putting them as subclass of the other non combined items you can achieve the same effect I guess) author TomT0m / talk page 11:35, 24 March 2021 (UTC)
- @TomT0m: actually I don't like approach with grammatical person (Q690940). you (L482) (and e.g. tu (L9096)) are not grammatical person (Q690940) (and thus would be a grammatical category (Q980357)) but a personal pronoun (Q468801), i.e. like thou (Q4466935). It just has some grammatical person (Q690940) like second-person singular (Q51929369). --Infovarius (talk) 21:07, 26 March 2021 (UTC)
Open up lexicographical data to Wiktionary
Given that the lexeme namespace in Wikidata (or lexemes as Wikibase entities) could benefit from more contributors, how could we do open it up to Wiktionary contributors? --- Jura 10:59, 18 March 2021 (UTC)
- See Wikidata:Wiktionary and phab:T212843. Cheers, VIGNERON (talk) 16:54, 24 March 2021 (UTC)
radical (P5280) for non-CJVK languages
What's the status of this? special:ListProperties?datatype=wikibase-lexeme gives no hints but something mentioned at lexeme talk:L1 made me wonder. Arlo Barnes (talk) 00:18, 30 October 2020 (UTC)
- @Arlo Barnes: what is your question or proposal exactly? radical (P5280) is not meant form lexemes. The talk on L1 was old, two month later combines lexemes (P5238) was created that can be used for that I think. Cheers, VIGNERON (talk) 07:46, 1 March 2021 (UTC)
- @VIGNERON: Thank you, I think that answers any questions I had. Arlo Barnes (talk) 19:58, 1 March 2021 (UTC)