Wikidata talk:Lexicographical data/Archive/2022/07

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Different spelling or different words?

mensa (L31224) = (L278590)? --Infovarius (talk) 19:19, 15 March 2022 (UTC)

@Infovarius: clearly the same lexeme. The diacritic (here a macron) a modern notation, it's useful for prononciation (to indicate a long vowel) but I'm not sure how it should be stored (see also Wikidata_talk:Lexicographical_data/Archive/2018/06#Arabic_diacritics where we mention this question). PS: Uzielbot import had a lot of strange things that should be fixed (but I'm not sure where to start...). Cheers, VIGNERON (talk) 15:49, 9 May 2022 (UTC)

@Infovarius: merge

Done. But I'm not sure what to do with the form now... @Escudero, Uziel302: VIGNERON (talk) 09:02, 5 June 2022 (UTC)

I believe they should be merged too (as different variants of spelling). --Infovarius (talk) 21:15, 6 June 2022 (UTC)

VIGNERON, when I made the import there were very few Latin words so I just imported everything that was on whitaker's program, and the duplicates created by the import should be merged. As of other strange things, some of them might be in the source, you can check online [1]. If there is a difference from that source, I would love to hear about it.Uziel302 (talk) 21:00, 2 July 2022 (UTC)

@Uziel302: I know, and at that time in the context of starting the lexemes, it was not a bad import but now, the standard are a bit higher and this is bad lexemes. For the cleaning, see below #Cleaning of Latin lexemes. And yes, I didn't see any difference (not yet) from the source but William Whitaker's Words (Q533803) is known to be a not bad but no so good source. Cheers, VIGNERON (talk) 09:00, 3 July 2022 (UTC)

The Grammatical Person category and its representation

I'm looking at the representation of lexemes from the perspective of using them in the Abstract Wikipedia project. Here I want to raise some qualms I have with the Grammatical Person category.

First, contrary to other grammatical categories, there is no corresponding Grammatical Person property. This means that one cannot add statements regarding the grammatical person of pronouns, for instance. Can I simply create such a property?
Second, from a linguistic perspective, as a grammatical category the person category has (typically) three possible values: first person, second person and third person. Yet these items are not marked as instances of the category, but rather as subclasses of it. As instances of the category we found composite features such as third-person masculine plural and similar beasts. Linguistically speaking, these are not person categories, but rather descriptions of pronominal elements, combining thus several atomic grammatical features. In fact, it was agreed in a previous discussion, not to use these combined features, but only the atomic ones. Is there a way to systematically clean this up? I would suggest the following steps:
- Make the atomic person categories instances of Grammatical Person.
- Either delete the composite person categories completely, or remove their instance of relation to Grammatical Person. In the later case, the composite person categories can still be annotated as subclasses of the atomic person categories, if we understand the subclass of relation as a subsumption relation (but that's probably another discussion).

Does this seem reasonable? Is there some easy way to make this bulk change? AGutman-WMF (talk) 20:46, 9 June 2022 (UTC)

@AGutman-WMF: I'm not sure what you are referring to regarding the property issue - perhaps you can provide more detail on this and the existing properties you see in this area (I wasn't aware we had them)? Please don't create a new property without going the the property proposal process - see Wikidata:Property proposal. On the second point, yes, this makes sense to me, I haven't really looked into it but it sounds like cleanup would be helpful. ArthurPSmith (talk) 13:48, 10 June 2022 (UTC)

Thanks for the pointer!

Some grammatical categories are modeled both as items and as properties:

Grammatical Gender: item and property.
Grammatical Aspect: item and property
Noun class: item and property

The properties are useful as they allow specifying a entry-level statement for a lexeme (not tied to a specific form). For instance, the lexeme she could have a statement Grammatical Person: third-person, instead of repeating this feature once for every form. The same is actually also true for the Grammatical Number category, which doesn't have a corresponding property either.

AGutman-WMF (talk) 11:51, 13 June 2022 (UTC)

Update: Looking into this, I see that there has already been a proposal for creating a Grammatical Person property, but it has been rejected on the ground that the person property is specific to forms and not to lexemes. While this is true for verbs, it is certainly not true for pronouns (where the person property is a property of the lexeme). Should I create a new proposal or revive the old one?

The same question is relevant also for the rejected proposal to create a Grammatical Number property. There too it was rejected on the ground that it is only relevant to forms, and not lexemes, while in fact there are lexemes (such as pronouns, but also plurale tantums or mass nouns) which have a fixed number category, which should IMHO be stated on the lexeme level.

AGutman-WMF (talk) 12:25, 13 June 2022 (UTC)

@AGutman-WMF: thanks a lot for these comments! I'm not entirely sure why you can't use the grammatical persons already indicated as grammatical features but feel free to make a proposal to explain in more detail your idea (with examples and ideally also some external references). Especially as the last proposal was from 2018, *before* the creation of the first lexemes, a lot as changed since.

For the atomic person categories, I think it's mostly clean on lexemes themselves (no systematically cleaning that I know of, not sure it's needed ; maybe we could have constraints for grammatical features?). And we can't delete the composite ones as it seems they are needed outside lexemes, removal as P31 might be a good idea (but indeed we probably need an another discussion).

For the « bulk change », not sure we have the tool but we have some tools (did you already take a look at Wikidata:Tools/Lexicographical data?).

Cheers, VIGNERON (talk) 18:29, 13 June 2022 (UTC)

Thanks! I've created now a property proposal for Grammatical Person. The logic behind it is that if a grammatical feature applies to the lexeme itself (i.e. all forms) it should be stated on the lexeme rather than on each form individually.

Every now and then I run into English verbs that use the third-person singular feature so I think there is an opportunity for a systematic clean-up here. Maybe this is something we could tackle in the upcoming Wikidata Quality Days.

In general, I think constraints on grammatical features would be very welcome. Logically, even features of forms, appearing as grammatical features, should be tied to grammatical properties and obey these constraints (as if they were statements), but apparently this is difficult to achieve in the existing data model. AGutman-WMF (talk) 15:42, 17 June 2022 (UTC)

@VIGNERON It seems the proposal does get now some support, an no objections. Who takes it from here to actually create the property? AGutman-WMF (talk) 10:44, 4 July 2022 (UTC)

Japanese words

In japanese and probably other languages, some kanjis can have several way to say it. For example for « 四 » in the sense of « 4 » it can be said « yon » or « shi », we have two lexemes for the two cases : 四/よん (L625228) or 四/し (L641752). They have the same sense and usually are just one line on dictonnaries as far as I can tell. Is it a good idea to have 2 lexeme or should we find another solution ?

It’s impossible to put the two variants in the same lexeme as lemmas because they share the same language code, ja-hira and you can have only one. Maybe there is over language code for variants ? I see in Chinese languages there can be up to 5 in Wikidata now (母語/bó-gí/bó-gí (L305218) for example). It can be put as form, of course, but is it the way to do ? author TomT0m / talk page 19:58, 21 June 2022 (UTC)

@TomT0m: This sounds similar to the problems we're facing in Vietnamese. Maybe the same workaround can be used for Japanese for now. Minh Nguyễn ^💬 01:02, 22 June 2022 (UTC)

@TomT0m: That problem is common in Japanese. 私/わたし (L676) is written "わたくし" as well as "わたし" in Hiragana. I did not know how best to resolve this. Afaz (talk) 02:06, 22 June 2022 (UTC)

This example is a sound change. The original form is "わたくし（watakushi）," and "わたし（watashi）" is a truncated form. However, "watashi" is often used in modern Japanese, and "watakushi" has the added nuance of being polite. In 2008, the Ministry of Education established that there are two kun-yomi of "私": "わたくし" and "わたし". Afaz (talk) 22:46, 22 June 2022 (UTC)

@TomT0m: I'm not entirely sure, it's a weird case here (at first, I thought it was a onyomi vs. kunyomi problem but it's does not seem to be the case and anyway, I'm not sure that the readings are enough alone to make it two different lexemes). Loominade and Shisma could you tell us why you created two lexemes here? and more importantly why is there no indication to distinguish these two lexemes. Indeed, we really need to think more on how to deal with asian and asiatic languages. Cheers, VIGNERON (talk) 21:49, 22 June 2022 (UTC)

I'm not an expert, but as far as I understand よん and し have different etymologies which should make them distinct lexemes. So if you ask me they are homograph lexemes that happen to have the same meaning. – Shisma (talk) 05:53, 23 June 2022 (UTC)

here is what I think is going on:

四輪車/よんりんしゃ (L678963) is derived from 四/よん (L625228) is derived from 🤷

四季/しき (L678968) is derived from 四/し (L641752) is derived from 四 (L656234)

but please correct me – Shisma (talk) 07:09, 23 June 2022 (UTC)

@Afaz, TomT0m: do you disagree? – Shisma (talk) 16:51, 26 June 2022 (UTC)

@Shisma: There is no Japanese dictionary that lists them as homographs. However, it is correct to say that they have different etymologies. "よん" is from the Japanese native word "よ" and this is called kun-yomi. The word "し" is from the Chinese sound of the Chinese word "四", and this is called the on-yomi. The problem is that both words with multiple kun-yomi and words with multiple on-yomi exist. Afaz (talk) 18:07, 26 June 2022 (UTC)

I think, this is precicely where we differ: I don't think of lexemes as having a, reading. For me a reading is a property that a lexeme doesn't have. The kanji has a reading: in this case 四 (Q3594955) where the lexeme(s) are the subject of each reading: The actual word. The kanji is merly a representation of the word. For a japanese-only dictionary it might be sensable to dismiss this layer of abstraction: but if we want to use wikidata to map ethymology across languages, I guess we need it. But ultimately I'm not an expert, neither in japanese nor in linguistics. 🤷 @AGutman-WMF: do you have thoughts on this? Is there an expert in both? – Shisma (talk) 18:15, 27 June 2022 (UTC)

@Shisma if we want to use wikidata to map ethymology across languages, I guess we need it I don’t think so. We need to reference senses as etymology, not lexeme, for this. A lexeme has typically several senses, not all of them match when they are derived in a new language. author TomT0m / talk page 18:18, 27 June 2022 (UTC)

assuming that each ethymology has exactly one sence? – Shisma (talk) 06:30, 28 June 2022 (UTC)

@Shisma I’m not sure it’s worth trying to align our whole data model to such a constraint … In a lot of cases the etymology will be missing, unknown or incomplete. So this means the whole lexeme ids/senses/forms may be (duplicated for the forms, this leads to duplication of datas, maybe more than if we duplicate etymology in each sense) and so a potential big disruption for data users, even wiktionaries ? … each time we discover a new etymology, or even over etymology disputes … (how do we handle conflicting etymology datas ? One lexeme for each hypothesis ?)

Lexeme ids stability seems much more important to me than etymology data ease of use. They are an important part of a dictionary of course, and important for our datas, an opportunity to structure etymology … but it does not seem to me the most important use of the lexeme datas. So we better try to weight the problems splitting item may arise for the main reusers.

If you want to avoid duplications we may find other solutions, like etymology statements not in the senses but as the main lexime statements, but with qualifiers like « apply to sense » to make them correspond to senses. Or creating items for etymologies and reference those items in the senses. Or put etymology in one sense and add a property « has the same etymology as sense […] » to put in other senses. author TomT0m / talk page 07:59, 28 June 2022 (UTC)

often the origin of the word influences their grammatical gender (P5185) and paradigm class (P5911) which in terms also influences their forms. Thats why all these properties belong to the lexeme rather than the sense, because that's not what senses are for. lexemes where that look identical (in at least one representation) use homograph lexeme (P5402) and we already have ~15000 of them. About ~4000 even in the same lexical category, check out some french ones. Would you merge those too? – Shisma (talk) 10:00, 28 June 2022 (UTC)

@Shisma For example for « tour » I would not merge those with the same grammatical gender (un tour vs. une tour). It’s an important features of french lexemes that can change the form when they are inflected.

For those with the same gender maybe ( tour : a tool to make circular objects, and tour, a full turn, I’m not sure why there is two lexemes)

I checked several examples of your query and it seems to be the case in most examples.

As far as I can see most don’t really have etymology informations yet, so how do you guarantee they won’t be further split in the future ? Or would some of them be merged because, after all they all derives from a close historical word sense ? author TomT0m / talk page 10:36, 28 June 2022 (UTC)

The « poisson » example seems more compelling to me poisson (L11978) poisson (L455419), in that it has two entries in an important lexicographical resource in french, cf. https://www.cnrtl.fr/definition/poisson/1 and https://www.cnrtl.fr/definition/poisson/0 .

But what could work for french, a well documented language for quite a long time, can work for over languages ? I’m not sure. And considering the goals of the lexicographical datas, my conviction is that that should be a point to take into account ? author TomT0m / talk page 11:27, 28 June 2022 (UTC)

let me give you an example of a language I actually speak 😅. On first sight Ausdruck (L296956) and Ausdruck (L296957) look like they are the same. One is apperently derived from a translation of french expression (L12883) and it means among others expression, language style or swear word. The other is a nominalisation of the verb ausdrucken (L680080) and it means printout or process of printing. Both have the same grammatical gender but all plural forms are different: Ausdrucke/Ausdrücke. That's why dictionaries list them as different lexemes: Duden 1 & Duden 2; dwds 1 & dwds 2. These words are different in essence. So are よん and し. Its just even less obvious because they don't have multiple forms (as far as i know). If hiragana wouldn't exist and if we wouldn't know how to pronounce these words, we would have to consider them to be a single lexeme – Shisma (talk) 15:35, 28 June 2022 (UTC)

This question goes to the core of what we understand as "Lexeme". In my opinion, when we are dealing with living (oral) languages, the basic units of language are the spoken forms and not their written representation. So, if we have two completely different spoken forms, as in this case, they should be represented as two different (though synonymous) lexemes. The fact that the two words can be written using the same Kanji character (and indeed, the fact that this is due to the history of how Kanji characters were introduced and read) should not confuse us. So I am in favor of keeping the current representation as two distinct lexemes. Of course, to enrich the data you can link the two lexemes as synonyms and also add statements clarifying the etymology of each lexeme (derived as On'yomi or Kun'yomi), using mode of derivation (P5886). AGutman-WMF (talk) 11:31, 28 June 2022 (UTC)

@AGutman-WMF « Lexeme » is used in some linguistic school as a semantic unit. If you are talking of the « form » here am I correct that you take a more « lemmatic » (a lemma is in Wikidata a chosen representation of a lexeme) definition of lexeme ?

Comparing with the « etymological » definition we discussed, this is a completely different viewpoint am I correct ? For example if we adopt the convention of the cnrtl poisson (L11978) poisson (L455419) are said exactly the same, but have an unrelated meaning.

Note, the current documentation, see the last table in the section Wikidata:Lexicographical_data/Documentation#Data_Model suggest all theses should be decided language by language. Does not seem to exhibit criteria at this point ? author TomT0m / talk page 11:53, 28 June 2022 (UTC)

Definitions of the term "Lexeme" may differ between linguistic schools of thought, but in general I think it is agreed that the Lexeme is a sub-case of the Saussurean Linguistic Sign, i.e. a linking between a sound pattern (the signifier) and a concept (the signified). Thus, if we change either of these two, the pronunciation or the meaning, we get a different lexeme. In the case of Poisson we have clearly two different meanings, so it warrants two lexemes, while in the Japanese case discussed here we have two different pronunciations, thus two lexemes. The difficult cases are when two pronunciations or senses are are quite close to each other, in which case they may be seen as variation within the same lexeme.

The etymology itself does not play a crucial role: one lexeme may be (in rare cases) issued from a merger of two etymologies, while one etymological item may lead to the emergence of two different lexemes. On the other hand, differing grammatical features (e.g. gender, part-of-speech) may be a reason to favor multiple lexemes, but these are typically also coupled to a difference in meaning.

As for the table you mentioned, I think it is a bit misguided, since it allows entering differing pronunciations as different forms, which goes, in my opinion, against the idea that the lexeme forms should represent the inflection paradigm of the lexeme. AGutman-WMF (talk) 14:05, 28 June 2022 (UTC)

@AGutman-WMF: Some months ago, I had tried this approach for Vietnamese, but the use of synonym (P5973) proved problematic, as it became impossible to distinguish between senses of two lexemes that were semantically synonymous from those that were differentiated only by the transcription method. It would not have been possible to tease out instances of the latter from the larger group of synonyms on the basis of pronunciation alone, as written dialectal variations and spoken dialectal variations were often only partially coincident. In my second attempt at your suggested approach the other day, I wound up using translation (P5972) instead, expanding the meaning of that property to include transcriptions. However, I remain interested in introducing more nuanced properties specific to transcriptions. Minh Nguyễn ^💬 04:38, 29 June 2022 (UTC)

I do not agree with the idea of keeping 四/よん (L625228) and 四/し (L641752) as separate lexemes. They are the same lexicon, even if pronounced differently. Afaz (talk) 05:12, 29 June 2022 (UTC)

I honestly want to understand, why you think so. I have an assumption based on what you said so far. Please tell me if I am right or wrong:

When you look at a japanese dictionary intended for real people (not nerds like us 😅), you see an entry along the lines of

四【し；よん；よ】4番目の正の整数

You wouldn't expect multiple entries like:

四【し】 4番目の正の整数
四【よん】 4番目の正の整数
四【よ】 4番目の正の整数

Because thats obvious and redundant.

Does this summarise your problem? Or is it something else? – Shisma (talk) 15:04, 29 June 2022 (UTC)

I have come to believe that the problem is that there is more than one lemma.

Stop using ja-Hira, ja-Kana, and ja-Hrkt.
- →Use only one Lemma (ja).
Use name in kana (P1814) to describe the reading of the lexemes.
Describe hiragana and katakana word forms in Forms instead of Lemma.

Example

Lemma: 四
name in kana (P1814): "よん" for kun-yomi
name in kana (P1814): "し" for on-yomi
Form: "四", "よん", "し", "ヨン", "シ", etc

The Japanese dictionary called UniDic is divided into three levels: Lemma, Forms, and Orthopaedic. Since there are three forms of Japanese, Orthopaedic describes all the Kanji, Hiragana, and Katakana forms of each word form. If we want to realize this in Wikidata, it would be enough to describe all the orthographic characters in Forms. Afaz (talk) 17:33, 29 June 2022 (UTC)

that's interesting. @AGutman-WMF: why is UniDic structured like this? And why isn't Wikidata? – Shisma (talk) 18:29, 29 June 2022 (UTC)

I'm not very knowledgeable about the UniDic representation, but from what I can gather it is especially geared towards written Japanese Corpus linguistics. If the basic unit is understood to be the written form, than one may collapse different spoken lexemes to a single written lexeme, if they have the same orthography (here a Kanji character). However, if we take the basic units of language to be the spoken form, this makes less sense. AGutman-WMF (talk) 12:28, 30 June 2022 (UTC)

@AGutman-WMF It’s never been formalized which kind of forms we are supposed to take as a master in Wikidata isn't it ? As far as I understand, the main applications will be textual in the foreseeable future.

Is it worth pondering if some representation will be easier to handle considering the main usage of datas we can envision ?

Interestingly as far as I can tell there is actually very little phonological informations on Wikidata.

My intuition is that as long as the informations are linked correctly it does not matter much if we conflate several « lexeme » on the same page ? But it seems important, for a structured project, that the structure is consistent. I don’t think we have much ideas of how consumers would use the datas to guide us, unfortunately … author TomT0m / talk page 14:13, 30 June 2022 (UTC)

However, if we take the basic units of language to be the spoken form, this makes less sense.

which is what wikibase is designed to do, and what all other languages in wikidata currently do. Is it? I assume it would be unwise to make an exception for japanese. @Afaz, TomT0m, AGutman-WMF: Can we agree on that? – Shisma (talk) 17:15, 30 June 2022 (UTC)

@Shisma Not so sure about the practice adopted by each language on Wikidata, as it’s not really documented anywhere and there is more than 200 languages. The page Wikidata:Lexicographical data/Documentation even suggest there may be different ways of doing In some cases or languages, there may be multiple entities for related words, in others just one. The below table provides an overview how they may be linked: author TomT0m / talk page 17:46, 30 June 2022 (UTC)

on the other hand, there is only 42 languages (can’t invent) with more than 1000 lexeme and 359 with just 1 lexeme … (full list of lexeme by language) author TomT0m / talk page 19:25, 30 June 2022 (UTC)

@TomT0m, @Shisma

The main use case I'm aware of for Wikidata's lexicographical knowledge is for use in Abstract Wikipedia. Admittedly, this use-case is currently for generation of written language (though it may change in the future). Still, I would prefer organizing the lexemes according to spoken representations, since it is linguistically more sound. It is also easier to lump together related spoken lexemes (using some property) rather than splitting written lexemes into different spoken representations. Another possible use-case may be exporting lexemes from Wikidata to Wiktionary; the latter seems to represent both the two readings of the Kanji and the Kanji character itself as distinct entries (though they are of course interrelated). This also directs us into the direction of using distinct lexemes for each reading. AGutman-WMF (talk) 13:07, 1 July 2022 (UTC)

Not all wiktionaries, for example the japanese one has one entry for both yon and shi : https://ja.wiktionary.org/wiki/%E5%9B%9B#%E5%90%8D%E8%A9%9E author TomT0m / talk page 14:22, 1 July 2022 (UTC)

@AGutman-WMF: In any case, I think the point of the story is that it’s hazardous assuming all languages will follow the same organisation at this point ?

Why : I’m not sure every contributor will be aware of the guidelines, the contributors communities may be too segmented by language, the guidelines are not very clear at this point and if they stay as comments in a talkpage in english it will for sure not be enough to ensure a strong coordination. Especially if the community starts to really grow.

Maybe we need at some point a more formal discussion like a RfC, involving as most as diversity in language in the writing process ? author TomT0m / talk page 14:28, 1 July 2022 (UTC)

Wiktionary's use case is not to be underestimated: there's a real need for structured representation of those wikis' contents, and a tantalizingly close solution in Wikidata's lexicographical data, the name of which suggests a focus on dictionary-making. The Wiktionary community's collective experience defining a wide variety of languages will be an asset to this project. Linguistic soundness is important, but so is some connection to longstanding convention. Minh Nguyễn ^💬 02:22, 2 July 2022 (UTC)

If there is a lesson to be learned from the wiktionnaries, I think it’s that several entry points seems useful. Jawikt and enwikt both have pages for wikt:en:四 wikt:ja:四 wikt:en:よん wikt:ja:よん wikt:en:し wikt:ja:し.

How do we get to the data ? From a searching point of view, I guess the interface to get in at least as important as the structure. There is not much work done atm. I think on the lexicographical data, this may co-evolve with the structuring of datas and help our decisions. author TomT0m / talk page 09:02, 2 July 2022 (UTC)

Since the Wiktionary creates a page for each word form, it makes no sense to map a page to a lexeme.

There are pages for "wikt:en:やっぱり" (yappari), "wikt:en:やっぱし" (yappashi), and "wikt:en:やっぱ" (yappa), but these are only variant forms of "wikt:en:やはり" (yahari). Afaz (talk) 16:04, 2 July 2022 (UTC)

One thing that is interesting with current modelling, is that all the lemmas are shown for example when we use the {{L}} template, so in the Japanese case those who read kanas can immediately have an idea of how it is said, even if incomplete. This is some help as there is no chance to guess that from a kanjis alone. author TomT0m / talk page 18:42, 29 June 2022 (UTC)

@Afaz: how does UniDic handle lexemes that have no kanji representation? – Shisma (talk) 18:56, 29 June 2022 (UTC)

UniDic's lemma are not limited to kanji. If only kana characters are available, they will be in kana characters. Afaz (talk) 14:32, 30 June 2022 (UTC)

This is a web service to search UniDic. https://cradle.ninjal.ac.jp/. All the forms 4 and IV are also grouped together under the lemma "四". Afaz (talk) 14:50, 30 June 2022 (UTC)

Maybe I misunderstand something: Yes, both have the lemma "四" but they seem to be two distinct entities: 四/よん and 四/し. Can you link to the entity that includes both readings? – Shisma (talk) 16:38, 4 July 2022 (UTC)

@Shisma There is a discussion on whether words with different etymology deserves different lexeme above, so I’m not sure it’s a sufficient reason to have two lexeme. author TomT0m / talk page 17:52, 27 June 2022 (UTC)

Where is this discussion? – Shisma (talk) 18:16, 27 June 2022 (UTC)

See #Splitting of L1131 above. author TomT0m / talk page 18:19, 27 June 2022 (UTC)

Multiple Lexical categories per lexeme?

In english, german and russian (and probably others) homograph lexemes can have different lexical categories: for instance there is

sound (L4695) (noun, the sound is wonderful)
sound (L510) (verb, this sounds wonderful)
sound (L27925) (adjective, it appears to be sound)

these are considered to be individual lexemes. They also come with their own set of forms:

sounds (plural of noun)
sounder (comperative of adjective)
sounded (simple past of verb)

This also occours in japanese. A good example might be:

また (L605) (conjunction)
また (L680885) (adverb)
又/また (L680886) (prefix)

(Please fix my translations in case they are off 😅) Only the prefix form actually usually uses the kanji representation 又 (I'm not sure what that means, please enlighten me 🙂). Now wikidata assumes that each lexeme should have only one lexical category and I guess that's alright 🤷. But it also collides with the assumption that all Japanese homographs should be treated as a single lexeme.

I'm sorry I had to create a subthread but I couldn't handle the indentations anymore 😭 – Shisma (talk) 16:18, 3 July 2022 (UTC)

Some senses have both « item for this sense » and « predicate for ». How are their values linked ?

I wanted to toy with that question so here is the query that lists them and find all predicates that links them : https://w.wiki/5Mhk

There is relatively few pairs of items that match this criteria on Wikidata yet.

Surprisingly we find that walk (Q25443024) and walking (Q6537379)   are currently unlinked on Wikidata.

Here is a list of the predicates used, thanks to listeriaBot :

Manually update list

This list is periodically updated by a bot. Manual changes to the list will be removed on the next update!

select (?pitem as ?item) ?property ?propertyLabel (count(?property) as ?count){   ?sens wdt:P5137 ?pratic ;         wdt:P9970 ?action .   ?lexeme ontolex:sense ?sens ;           dct:language ?lang ;           wikibase:lemma ?lemma .        SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }        {?pratic ?pred ?action . } union {?action ?pred ?pratic }      ?property wikibase:directClaim ?pred   optional {     ?property wdt:P1629 ?pitem .   } } group by ?property ?pitem ?propertyLabel order by desc(?count)

?item	?property	?propertyLabel	?count
cause	has cause has effect has immediate cause	has cause has effect has immediate cause	21 1
effect	has effect	has effect	21
Wikimedia duplicated page	permanent duplicated item	permanent duplicated item	3
Wikimedia permanent duplicate item	permanent duplicated item	permanent duplicated item	3
permanent duplicated item	permanent duplicated item	permanent duplicated item	3
subclass of	subclass of	subclass of	1
aspect	facet of	facet of	1
research	is the study of	is the study of	1
academic discipline	studied in	studied in	1
eponym	named after	named after	1
memorial society	named after	named after	1
namesakes	named after	named after	1
has parts of class	has part(s) of the class	has part(s) of the class	1

End of automatically generated list.

author TomT0m / talk page 17:36, 27 June 2022 (UTC)

walk (Q25443024) and walking (Q6537379)  

should probably be merged. We need people that speech Czech to sort that out. I my oppinion predicate for (P9970) should only be used on verbs and item for this sense (P5137) should only be used on nouns. — Finn Årup Nielsen (fnielsen) (talk) 20:04, 4 July 2022 (UTC)

What about go (Q5574688)? --Infovarius (talk) 09:42, 6 July 2022 (UTC)

go (Q5574688) is more like a « word/lexeme item », it’s an article about the word, the conjugation etymology and so on, and not really about its meaning. Just as a lexeme as forms for the conjugation, plural and statements for the etymology.

If there is a relationship between a lexeme and such a « word-item » it should be « item for this lexeme » as a top level statement. author TomT0m / talk page 11:36, 6 July 2022 (UTC)

Japanese する (suru)-form of nouns

For the lexeme challenge I created some lexeme that are listed by enwiki as verbs associated to some nouns, example for division : https://en.wiktionary.org/wiki/%E9%99%A4%E7%AE%97 . I however have a doubt about whether I should have done that. It seems they rarely/not have a page on jawikt for example.

Is it just akin to transform a noun like « division » into « doing a division » in english for example and as such does not really deserves a lexeme page, except in some particular places ? Should they just be removed from enwiki ?

If they do not deserves a lexeme on Wikidata, do they deserves a form on the noun-lexeme page ? It would be weird to have a form that is not the same grammatical class than the main lexeme I think … author TomT0m / talk page 16:44, 23 June 2022 (UTC)

@TomT0m: I'd say those lexemes of the sort you question which exist in databases such as JMdictDB (and thus have at least one external identifier) or in dictionaries with definitions given in Japanese (such that they may be properly referenced via described by source (P1343) qualified with a page number) can stay. Mahir256 (talk) 17:36, 23 June 2022 (UTC)

@Mahir256 That don’t really answer my question I guess. My understanding is that in other languages, french for example, we sometime can change the kind of word by adding a suffix.

For example we can pass from the adjective "possible" with the adverb "possiblement" by adding the suffix "ment". Or in english "true => truely" . In Japanese there seem to have a similar mechanism, "本当" can mean "true" and truely can be said "本当に", by adding a "ni" particle. Whereas in french or in english it seems clear that both forms have their pages, it does not seem to be the case in Japanese. My point is that some of these adverbs, like « 上手に » does not seem to have an entry in for example JMdictDB and by your rule should not be a lexeme, but seems to be used in real expressions. It would not occur in languages like french or english I guess … as far as I know a similar pair of words « habile / habilement » in french always both have entries in french dictionaries.

Of course it’s different in Japanese but in the dictionaries in Japanese seem to consider usually that the added particles does not account for a new lexeme are just forms.

The problem is, on Wikidata I think there is only one main grammatical category associated to a lexeme. As I said in that model it seem weird that a form has a different category than the main lexeme.

How can we account for these differences, which seems mainly cultural and not totally « linguistic » in Wikidata ? Is it really a problem to have a lexeme for them ? author TomT0m / talk page 19:30, 23 June 2022 (UTC)

we currently have

本当/ほんとう (L680773): Lexical category: na-adjective (Q1091269)
本当に/ほんとうに (L671856): Lexical category: adverb (Q380057)

and

元気/げんき (L2454): Lexical category: na-adjective (Q1091269)
元気に/げんきに (L2454-F4): Grammatical features: conjunctive form (Q2888577)

for some reason, this is even reflected in JMdictDB. My intuition is nouns and adjectives should always be separate lexemes. maybe because in my language they are easy to tell apart. But in japanese the line seems to be blurry 🤷. Maybe again @AGutman-WMF: can help us here? – Shisma (talk) 13:35, 9 July 2022 (UTC)

In many languages there is no clear distinction between nouns and adjectives, in that adjectives can often be used as nouns, e.g. Hebrew חכם/חָכָם (L65269) and (L210912) are in fact the same word, which can be used either as the adjective "wise" or as the noun "wise man". It seems that in Japanese there is a whole class of such words, in which the distinction is only marked by means of the syntactic particle -na, so it is reasonable to collapse the adjective and noun lexemes together. As for the adverbial forms, insofar they are completely productive and regular, there is no need to list them at all. However, if there is some idiosyncrasy, I would list them. AGutman-WMF (talk) 09:22, 11 July 2022 (UTC)

Looking for input re: a sense specific to a plural form

I've started adding lexemes related to street furniture in Punjabi, and have run into some senses which only apply to the plural form of a word with multiple senses. For example, Lexeme:L680584 can mean light source(s), generically. The plural form can be used to specifically describe traffic lights. What would make more sense?:

sense for traffic lights, with subject form linking to the plural
separate "plurale tantum" lexeme for with the traffic lights sense

I am leaning to the former, in which case I am also wondering if "plurale tantum" can be added as a statement to the specific sense rather than at the lexeme level.

Any thoughts are appreciated --Middle river exports (talk) 15:36, 3 July 2022 (UTC)

I see similar situations in Russian. I am too lazy to create separate lexeme, so I just mark specific sense as "pluralia tantum". But probably separate lexeme approach is more clean. --Infovarius (talk) 18:59, 4 July 2022 (UTC)

I'm in favour of the first proposition, separate lexemes will just create redundant data and a lot of problems. Cheers, VIGNERON (talk) 13:59, 9 July 2022 (UTC)

Splitting of L1131

Hi y'all,

Looking at the example above, do we all agree that key (L1131) should be split, as (at least) L1131-S5 belong to a different lexeme? @Mxn: WDYT? VIGNERON (talk) 11:08, 19 June 2022 (UTC)

Yes it looks like it - wikt:key has a separate etymology for that sense. ArthurPSmith (talk) 14:12, 20 June 2022 (UTC)

I was not aware we were supposed to split lexeme by ethymology. As a first thought it seem like a potential very big potential headache as we don’t always really know the etymology for each sense ? And splitting items is not an easy task as you have to move the references to the sense/forms as well as they have no own identifiers … author TomT0m / talk page 16:36, 20 June 2022 (UTC)

Yeah, unfortunately gadget Move doesn't support moving senses and forms yet (and my plea was left without answer...). --Infovarius (talk) 20:01, 20 June 2022 (UTC)

@TomT0m: There's no need to split if etymology is not known, but if you have two distinct ones then how do you represent that in our current lexeme model? The etymology is at the top level of the lexeme, but if you have several and each applies only to some senses then that gets complicated. Better to split I think. ArthurPSmith (talk) 13:52, 23 June 2022 (UTC)

@ArthurPSmith:, thanks I'll create a new item for the split soon (S5 and F3-F4).

@TomT0m: you're absolutely not « supposed to split lexeme by ethymology », etymology is just a clue that this is two lexemes here and not one (counter-example: in some rare case, one lexeme can have two etymologies for instance, in such case we obviously don't split). But yes, this could be a headache (but lexicographically - identifying two lexemes with the same lemma, lexical category and lang is not always easy - and technically - see Infovarius idea -, that said they do have identifiers L1131-S5, L1131-F3 and L1131-F4 here).

@Infovarius: good idea, I supported your request.

Cheers, VIGNERON (talk) 21:24, 22 June 2022 (UTC)

@VIGNERON: how about a possibility to create (almost) duplicate Lexeme (when most of the content is similar)? Please discuss here (or here). --Infovarius (talk) 07:30, 23 June 2022 (UTC)

@ArthurPSmith, TomT0m, Infovarius: It's done (painfully manually), the second lexeme is key (L684194). I also re-created the USPS abbreviation (Q30619513) it's not optimal but untill we have a solution, it's better than nothing and at least the information is not lost. Cheers, VIGNERON (talk) 09:42, 14 July 2022 (UTC)

Cleaning of Latin lexemes

Hi,

I talked about it a few times but we really need to clean the lexemes in Latin (Latin (Q397), Ordia), there is a lot of things to do but here is a start with some easy suggestion :

remove redundant/unnecessary grammatical features (BTW, I think this general rule applies to all lexemes)
- the lexical category when it's already as a lexical category
- the grammatical gender when it's already in grammatical gender (P5185)
- (maybe other that I missed?)
move ablative in Latin (Q4668057) to ablative case (Q156986)
remove all the la-x-Q533803 representations and add the root (part of the string before the dot) of these representation as word stem (P5187) (the root is generally unique if not then add a qualifier to the corresponding forms)
move all the statements described by source (P1343) = described by source (P1343) from form level to lexeme level

To give a more clear view, I did it by hand on bellum (L260469) : before and after.

What do you all think? Is it okay, do you see anything more to clean or things to clean differently?

PS : after this 'easy' part (in the sens it can be automated), there will a lot more work to do, like check and correcting all the strange thing from Whitaker (eg. belle as L:L260469#F4 and F5?, where the regular form is bella). VIGNERON (talk) 12:19, 19 June 2022 (UTC)

Support Looks good to me. ArthurPSmith (talk) 14:15, 20 June 2022 (UTC)

Support, I basically imported the data close to the source (Whitaker), I accept that some of the data is not needed or not accurate in the context of Wikidata. Uziel302 (talk) 19:20, 13 July 2022 (UTC)

Multifaceted language variants in representations

I've been using ad hoc language codes to represent language variants of Vietnamese, for which Wikibase lacks recognized codes. For example, in hủy bỏ/huỷ bỏ (L679211), vi-x-Q55856374 "bánh mỳ" is for Northern Vietnamese (Q55856374) and vi-x-Q11994045 "bánh-mì" is for syllabification (Q11994045), namely the dated practice of spelling compound words with hyphens instead of spaces. But I would also like to indicate that "bánh-mỳ" is the result of combining the two. There are other combinations too, such as final-vowel tone mark placement (Q112681980) with syllabification (Q11994045). Unfortunately, it doesn't seem possible to string together two QIDs in a language code, as phab:T236593#5610378 seemed to suggest was the original intent. How do folks currently work around this limitation? Create items for combinations of variants? Minh Nguyễn ^💬 20:41, 28 June 2022 (UTC)

@Mxn: I'm pretty sure I didn't understand the specific but wouldn't a general solution to use statements instead of trying to put everything on the language code. Also, wouldn't hyphenation (P5279) be useful here? Cheers, VIGNERON (talk) 09:48, 14 July 2022 (UTC)

@VIGNERON: Sorry I missed your ping until now. That would be the general solution, but unfortunately it isn't possible for statements or qualifiers to qualify a lemma or a representation of a form. Minh Nguyễn ^💬 05:42, 23 September 2022 (UTC)

New Lexeme creation page available for testing

Tracked in Phabricator
Task T313113

Tracked in Phabricator
Task T313166

Tracked in Phabricator
Task T195469

Hi everyone,

The lexicographical data part of Wikidata is still in need of some love. Over the last few weeks we have worked on this in 2 areas. The first one was Lua access to Lexemes. We have rolled this out and all Wikimedia projects can now access not just Item data but also Lexemes. See the announcement for more details. The second one is coming today. We have reworked the Special:NewLexeme page. Lexicographical data is still hard to understand for people not familiar with lexicography. The new Special:NewLexeme page has a number of tweaks that we hope will make it more understandable and easier to use. This includes an information panel that gives a bit of context about what Lexemes are as well as a lexical category selector that ranks appropriate Items higher. (Better ranking of the Items in the language selector will come soon as well.) Additionally we have put the page on a better technical base.

The information panel on the new special page includes an example Lexeme, which uses live data of a real Lexeme on Wikidata. That Lexeme is selected by the wikibaselexeme-newlexeme-info-panel-example-lexeme-id interface message in the current user interface language. The idea is that you can override this message on Wikidata to select suitable example Lexemes for various languages (e.g., set the German version of the message to the ID of some suitable German example Lexeme).

Today we’d love for you to test the new page and give feedback.

Test it: Special:NewLexemeAlpha

This page will be there in parallel to the current version during this testing period. It creates proper Lexemes and you can use it for your regular Lexeme creation work. We currently plan to replace Special:NewLexeme with this new version on August 3rd. At that point we also plan to turn off the temporary Special:NewLexemeAlpha page.

If you have feedback or questions please let us know here. Additionally Lydia is looking for a few people for some short calls to get individual feedback from you. If you are up for that please let me know and we’ll schedule something.

Cheers, -Mohammed Sadat (WMDE) (talk) 13:43, 14 July 2022 (UTC)

The "Lemma" and "Lexeme's language" fields should show an example in the user's interface language. I have no idea what "ama" means and whether it's a full word since the field says its the "base word" and idk if that means a full word or not as a beginner. Lectrician1 (talk) 14:09, 14 July 2022 (UTC)

@Lectrician1: I agree with you but from what I understand this is exactly what is stated in this message above! Could someone change MediaWiki:Wikibaselexeme-newlexeme-info-panel-example-lexeme-id/br to L62 ? And for other languages, I guess the smallest Lid in the language is probably a good choice for starter (caveat: languages of the interface are not exactly the same as languages of the lexemes). Cheers, VIGNERON (talk) 16:22, 14 July 2022 (UTC)

Yes exactly! That's what we meant with that in the announcement. For English we recommend L344. Lydia Pintscher (WMDE) (talk) 16:23, 14 July 2022 (UTC)

I created edit requests for the English and German version of the message (en talk, de talk). @VIGNERON, I suggest you create an edit request for the Breton version of the message as well, so it shows up in the tracking category (though I’m not sure how many people watch that category, to be honest). Lucas Werkmeister (talk) 15:54, 18 July 2022 (UTC)

Why not use translatewiki.net? Afaz (talk) 20:21, 18 July 2022 (UTC)

I can't seem to find much difference between this and the current version. Also, the absence of the * in the fields makes it look like filling those fields are optional.

A gist about what lexemes are is also a plus. Musahfm (talk) 14:56, 14 July 2022 (UTC)

@Musahfm: I see a lot of small but great improvements. Agreed for the mandatory field indication (both the * - that I never saw - and the red highlight when you leave a field empty). @Mohammed Sadat (WMDE): is this absent because of the test? if not, could it be add? And yes, a definition of what a lexeme is could be useful but so far we don't have one (that the downside of lexicographs, most of us already know what it is but can't really define it). Cdlt, VIGNERON (talk) 16:22, 14 July 2022 (UTC)

@Musahfm, VIGNERON, Thanks for letting us know that the mandatory field indicators are useful for editors. I created a ticket so we can add it. Regarding a definition for Lexems, can the community come up with one? -Mohammed Sadat (WMDE) (talk) 08:03, 15 July 2022 (UTC)

Why is it using a different font from the rest of the site?

The help link for spelling variants should not point to Help:Monolingual text languages. That's nothing to do with lexeme languages and won't help people understand what a spelling variant is.

When the spelling variant field is shown, it sometimes shows a warning saying "This Item has an unrecognized language code is. Please select one below."... and sometimes doesn't. (I know what's going on, because I know the internals of how it's implemented, but I don't expect it to make any sense to most people)

- Nikki (talk) 18:16, 14 July 2022 (UTC)

I filed phab:T313166 for the font issue.

The link we can definitely change. Do we have good page to point it to?

For the spelling variant issue: Can you give me an example code please? Thanks! --Lydia Pintscher (WMDE) (talk) 17:32, 16 July 2022 (UTC)

I can't see one needed feature: check the lexeme for existance. It would be superuseful (especially, for newer editors) to know if there is already such lexeme.
Could have "real" languages (e.g. with P31=Q34770 or P31/P279* =Q34770 if possible) be preferred in pop-down list? E.g. one need to enter at least 5 Cyrillic characters for selecting Q7737. --Infovarius (talk) 14:35, 15 July 2022 (UTC)

Support on the drop-down list, maybe it should filter on anything that has one of the ISO language code properties (P218, P219, etc.)? ArthurPSmith (talk) 16:20, 15 July 2022 (UTC)

Yes languages will be prioritized soon. This should happen with one of the next deployment as Mohammed wrote in the initial announcement.

As for checking for existence: Noted. We have phab:T195469 for it but we were not able to include it yet. I hope we can get to it in the next iteration. --Lydia Pintscher (WMDE) (talk) 17:32, 16 July 2022 (UTC)

Wikidata talk:Lexicographical data/Archive/2022/07

Contents

Different spelling or different words?

The Grammatical Person category and its representation

Japanese words

Multiple Lexical categories per lexeme?

Some senses have both « item for this sense » and « predicate for ». How are their values linked ?

Japanese する (suru)-form of nouns

Looking for input re: a sense specific to a plural form

Splitting of L1131

Cleaning of Latin lexemes

Multifaceted language variants in representations

New Lexeme creation page available for testing

Navigation menu

Wikidata talk:Lexicographical data/Archive/2022/07

Different spelling or different words?

The Grammatical Person category and its representation

Japanese words

Multiple Lexical categories per lexeme?

Some senses have both « item for this sense » and « predicate for ». How are their values linked ?

Japanese する (suru)-form of nouns

Looking for input re: a sense specific to a plural form

Splitting of L1131

Cleaning of Latin lexemes

Multifaceted language variants in representations

New Lexeme creation page available for testing

Navigation menu

Search