Wikidata talk:Lexicographical data/Archive/2021/03

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

English Wiktionary languages

As we finally got the occasion to index all languages of Wikisource, I figured I should do the same for Wiktionary. Looking at https://en.wiktionary.org/wiki/Category:All_languages it counts 5272 subcategories: https://petscan.wmflabs.org/?psid=18537705

ca. 4500 categories now have an item and its language defined at Wikidata (with topic's main category (P910))

Some ca. 1000 were created today, others in 2017, but had been missing its P910 till today.

ca. 500 categories have an item, but its language isn't defined at Wikidata: https://petscan.wmflabs.org/?psid=18537703

Some may already have its language defined on the Wiktionary category page with a Wikidata item (?!)

ca. 180 categories have yet to be connected to an item (I skipped some of the proto-languages): https://petscan.wmflabs.org/?psid=18537695

wiktionary:Wiktionary:Statistics/generated has detailed stats and counts 4205 "language headers", 2800 with less than 10 "gloss definitions". --- Jura 00:48, 1 March 2021 (UTC)

Focus languages for improvements to the lexicographic extension of Wikidata and Abstract Wikipedia

Hi. We would like to find two or three language communities who would be good matches to help to start and guide some long-term improvements to the lexicographic data part of Wikidata, and the closely related work in the Wikifunctions wiki and the Abstract Wikipedia project, over the next few years. Participating communities will hopefully find that this project will lead to long-term growth in content in Wikipedia and Wiktionary in and about their language. See Wikidata:Lexicographical data/Focus languages for more information. Please help us identify potential good matches. More details are on that page. Thank you! Quiddity (WMF) (talk) 00:09, 4 March 2021 (UTC)

Demonyms created as items, instead of lexemes

Haligonians (Q104720536) is listed for deletion at Wikidata:Requests_for_deletions#Q104720536. It's currently an instance of demonym (Q217438). --- Jura 21:45, 8 March 2021 (UTC)

30 Lexic-o-days, events and challenges about lexicographical data

Hello all,

I'm glad to announce that 30 Lexic-o-days, a series of events, projects and challenges around lexicographical data, will start on March 15th. There will be discussions, presentations, but also activities like improving the documentation of Lexemes or editing challenges. The goals of this event is to gather people editing Lexemes to have discussions around the content and work together. You can find the schedule and all relevant links on this page.

This format is a first experiment and its content is powered by the community: if you have ideas or wishes for the discussions, you're very welcome to set up an appointment or to create a task on the related Phabricator board! We're also keeping an open list of ideas here. Discussions about Lexemes, or summaries of future discussions that will take place during the event, should be documented on the project page or its talk page.

If you have questions or need help to participate, feel free to contact me. I'm looking forward to your participation! Cheers, Lea Lacroix (WMDE) (talk) 12:32, 9 March 2021 (UTC)

Lexicographic coverage of Wikipedia

lang	Forms in Wikidata	Forms in Wikipedia	Tokens	Covered forms	(%)	Missing forms	Covered tokens	(%)	Missing tokens	Notes
en	64494	963849	1529657229	41654	4.3	922195	1345425070	88	184232159
et	1606320	131017	16958786	72103	55	58914	13378652	78.9	3580134
da	31671	112469	31303226	17096	15.2	95373	24553257	78.4	6749969
sv	145092	224857	72905148	39899	17.7	184958	55812149	76.6	17092999
fr	17572	554837	480077448	25562	4.6	529275	356747841	74.3	123329607
de	21654	1035287	609589775	15705	1.5	1019582	381764040	62.6	227825735
sk	65104	119565	18106551	35858	30	83707	10763899	59.4	7342652
ru	909795	795720	292922734	158112	19.9	637608	136397968	46.6	156524766
he	324189	250049	77382629	53709	21.5	196340	35040268	45.3	42342361
cs	106648	273215	73639616	26526	9.7	246689	32663946	44.4	40975670
es	4069	450682	410174431	5763	1.3	444919	169742085	41.4	240432346
it	547	399799	285251736	860	0.2	398939	93932894	32.9	191318842
pl	15161	386124	118098912	7071	1.8	379053	37465675	31.7	80633237
no	5090	158509	49548917	2499	1.6	156010	12431637	25.1	37117280	(w/ 10+ tokens)
fi	7067	294157	47543118	6404	2.2	287753	11613274	24.4	35929844
nl	183	277692	131747011	280	0.1	277412	25134658	19.1	106612353
bn	27490	649854	13474320	5478	0.8	644376	2109462	18.6	11364858	selection slightly different
pt	226	244299	159495159	347	0.1	243952	25751182	16.1	133743977
hi	151	54671	18940269	106	0.2	54565	2859937	15.1	16080332
ms	358	58423	16381541	521	0.9	57902	2446411	14.9	13935130
ca	116	182535	109381764	97	0.1	182438	11036460	10.1	98345304
hr	54	137158	28734051	52	0	137106	1311223	4.6	27422828
uk	735	421935	115177791	710	0.2	421225	4254846	3.7	110922945
lv	38	60997	8034762	36	0.1	60961	264719	3.3	7770043
bg	166	124069	33507484	152	0.1	123917	347412	1	33160072
ro	24	134905	41483262	38	0	134867	337091	0.8	41146171
ar	202	248809	69904516	35	0	248774	245740	0.4	69658776
id	13	113157	40564473	13	0	113144	111944	0.3	40452529
fa	45	107982	45328313	27	0	107955	111378	0.2	45216935
hu	131	290408	65863565	102	0	290306	133238	0.2	65730327
ko	24	292283	34022746	22	0	292261	74177	0.2	33948569
lt	34	101819	13119499	20	0	101799	16277	0.1	13103222
tr	21	165383	30066752	35	0	165348	39628	0.1	30027124
el	9	132176	40493491	9	0	132167	4024	0	40489467
sl	2	115894	19773777	3	0	115891	2133	0	19771644
sr	9	201413	42242364	2	0	201411	682	0	42241682
th	3	28208	2090490	2	0	28206	634	0	2089856
tl	1	22614	3604273	2	0	22612	1092	0	3603181
vi	5	71119	75800297	6	0	71113	6510	0	75793787

Wikidata:Lexicographical_coverage has some some interesting stats. Above a summary of these. There is some discussion about it at Wikidata_talk:Lexicographical_coverage. --- Jura 13:28, 9 March 2021 (UTC)

Prospection around a « verb conjugation » gadget

Hi, I have a little project to create a gadget to show conjugation for verbs using Wikidata datas. So I tried stuff in SPARQL, like this query to find the forms of verbs in french in different tenses:

See a first attempt in french :

select ?verbe ?verbeLabel ?tempsLabel ?LabelArticle ?repre {
  values (?temps ?rangTemps ?tempsLabel) { 
    (wd:Q192613 1 "présent") 
    (wd:Q442485 2 "passé simple") # passé simple
    (wd:Q17081589 2 "passé simple(fr)")
    (wd:Q1475560 3 "futur simple")
    (wd:Q1336020 5 "passé composé")
  }
  values (?article ?rangArticle ?LabelArticle) { 
    (wd:Q51929218 1 "je") 
    (wd:Q51929369 2 "tu")
    (wd:Q51929447 3 "il/elle")
    (wd:Q51929447 4 "il")
    (wd:Q52431970 5 "elle")
    (wd:Q51929290 6 "nous")
    (wd:Q51929403 7 "vous")
    (wd:Q51929517 8 "ils/elles")
    (wd:Q52432983 9 "ils")
    (wd:Q52433019 10 "elles")
  }
  ?verbe ontolex:lexicalForm ?forme ; 
         dct:language wd:Q150 ;
         wikibase:lemma ?verbeLabel .
  
  ?forme wikibase:grammaticalFeature ?temps ;
         wikibase:grammaticalFeature ?article ;
         ontolex:representation ?repre
  # SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
} order by ?verbe ?rangTemps ?rangArticle

Try it!

(Either my query is bad or there is not much verbs forms in french right now.)

I wondered how to generalize this query to allow other languages in french, so I tried to build another query that finds the personal pronouns in all languages :

select ?trait_grammatical ?mot ?lang {
  
  ?lex wikibase:lemma ?mot ; dct:language ?lang .
  ?lex ontolex:sense/wdt:P5137 ?trait_grammatical .
  
  ?trait_grammatical wdt:P279* wd:Q690940 .
}

Try it!

Far from complete as well. Same question, is my query bad ? it queries item for this sense (P5137) to link the lexeme senses to a grammatical person (Q690940)   . Is this a good approach ? author TomT0m / talk page 15:47, 13 March 2021 (UTC)

@TomT0m: thanks for this query, it allowed me to find and correct mistakes. first-person singular (Q51929218) should not be used as grammatical feature, it should be first person (Q21714344) + singular (Q110786) (like passé simple (Q17081589) should not be used). You can look at bouger (L10251) for a not-too-bad example (some details still need to be improved, see Talk:Q12547192 for instance). PS: third-person masculine plural (Q52432983) and third-person feminine plural (Q52433019) are indeed grammatical persons in French but they're not used for conjugation flexions. Cdlt, VIGNERON (talk) 11:24, 24 March 2021 (UTC)

@VIGNERON: mmm I think I’d prefer to keep forms such as first-person singular (Q51929218), it would make the query easier (by putting them as subclass of the other non combined items you can achieve the same effect I guess) author TomT0m / talk page 11:35, 24 March 2021 (UTC)

@TomT0m: well, for obvious reasons (no new items to create, translate and manage) and since both method are equivalent, it has been decided to go wiht the easier one and no to use combined items, see Wikidata talk:Lexicographical data/Archive/2019/12#Use of combined grammatical features like “second-person plural” or Wikidata talk:Lexicographical data/Archive/2018/06#Missing items for grammatical features for instance. BTW, right now, we only have 47 persons, most of them don't have many data, not even labels in English nor French :( and some others are not even in subclasses (I may improve them a bit but I already have a lot to do and days only have 24 hours). Cheers, VIGNERON (talk) 14:15, 24 March 2021 (UTC)

@TomT0m: actually I don't like approach with grammatical person (Q690940). you (L482) (and e.g. tu (L9096)) are not grammatical person (Q690940) (and thus would be a grammatical category (Q980357)) but a personal pronoun (Q468801), i.e. like thou (Q4466935). It just has some grammatical person (Q690940) like second-person singular (Q51929369). --Infovarius (talk) 21:07, 26 March 2021 (UTC)

Open up lexicographical data to Wiktionary

Given that the lexeme namespace in Wikidata (or lexemes as Wikibase entities) could benefit from more contributors, how could we do open it up to Wiktionary contributors? --- Jura 10:59, 18 March 2021 (UTC)

See Wikidata:Wiktionary and phab:T212843. Cheers, VIGNERON (talk) 16:54, 24 March 2021 (UTC)

radical (P5280) for non-CJVK languages

What's the status of this? special:ListProperties?datatype=wikibase-lexeme gives no hints but something mentioned at lexeme talk:L1 made me wonder. Arlo Barnes (talk) 00:18, 30 October 2020 (UTC)

@Arlo Barnes: what is your question or proposal exactly? radical (P5280) is not meant form lexemes. The talk on L1 was old, two month later combines lexemes (P5238) was created that can be used for that I think. Cheers, VIGNERON (talk) 07:46, 1 March 2021 (UTC)

@VIGNERON: Thank you, I think that answers any questions I had. Arlo Barnes (talk) 19:58, 1 March 2021 (UTC)

Wikidata talk:Lexicographical data/Archive/2021/03

Contents

English Wiktionary languages

Focus languages for improvements to the lexicographic extension of Wikidata and Abstract Wikipedia

Demonyms created as items, instead of lexemes

30 Lexic-o-days, events and challenges about lexicographical data

Lexicographic coverage of Wikipedia

Prospection around a « verb conjugation » gadget

Open up lexicographical data to Wiktionary

radical (P5280) for non-CJVK languages

Navigation menu

Wikidata talk:Lexicographical data/Archive/2021/03

English Wiktionary languages

Focus languages for improvements to the lexicographic extension of Wikidata and Abstract Wikipedia

Demonyms created as items, instead of lexemes

30 Lexic-o-days, events and challenges about lexicographical data

Lexicographic coverage of Wikipedia

Prospection around a « verb conjugation » gadget

Open up lexicographical data to Wiktionary

radical (P5280) for non-CJVK languages

Navigation menu

Search