Wikidata talk:Lexicographical data/Archive/2021/06

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Error when trying to add a lexeme: Gloss language codes must be valid

Hi, I wanted to add some lexemes in languages of Indonesia. I tried to add Lexeme:L497375 as "bke" but it returns Gloss language codes must be valid. The same with "liw" Unknown language "liw" for term "air" in field "lemmas" at "liw".. How could I enter those language codes? Bennylin (talk) 20:09, 2 June 2021 (UTC)

Error when trying to merge

I tried to merge Lexeme:L303173 janggut/jenggot (L303173) and Lexeme:L303174 (L303174), but unable to do so. How should I merge them? Bennylin (talk) 23:27, 3 June 2021 (UTC)

@Bennylin: there is a specificity when merging lexemes: you can't have two lemmata with the same language. Before merging, you need to change on lang tg (to id-x-QXXX for instance - with the relevant QXXX obviously, like you did on kucing (L498558)). Cheers, VIGNERON (talk) 11:37, 4 June 2021 (UTC)
OK, thanks for the answer! Bennylin (talk) 13:13, 4 June 2021 (UTC)

Need help with query

Need help from the experts here. I have this query below, if I don't want to display the alternative spellings (e.g. L498556 and L498558), how can I do that?

The following query uses these:

  • Items: Indonesian (Q9240)  View with Reasonator View with SQID
    SELECT ?lexeme ?lemma ?category ?categoryLabel WHERE {
      ?lexeme dct:language wd:Q9240; 
              wikibase:lemma ?lemma;
              wikibase:lexicalCategory ?category;
              wikibase:lemma [].
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],ml". }
    }
    ORDER BY ?categoryLabel ?lemma
    

Bennylin (talk) 21:00, 5 June 2021 (UTC)

@Bennylin: Hi, I think you can filter on the language of the lemma:

The following query uses these:

  • Items: Indonesian (Q9240)  View with Reasonator View with SQID
    SELECT ?lexeme ?lemma ?category ?categoryLabel WHERE {
      ?lexeme dct:language wd:Q9240; 
              wikibase:lemma ?lemma;
              wikibase:lexicalCategory ?category;
              wikibase:lemma [].
      FILTER(LANG(?lemma) = "id")
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],ml". }
    }
    ORDER BY ?categoryLabel ?lemma
    

Lepticed7 (talk) 23:21, 5 June 2021 (UTC)

Amazing! Thanks, Lepticed7! Bennylin (talk) 16:00, 6 June 2021 (UTC)
@Bennylin: you could also group the alternative spellings together on the same line, like that :
SELECT ?lexeme ?category ?categoryLabel (GROUP_CONCAT(?lemma ; separator=" / ") as ?lemmata) WHERE {
  ?lexeme dct:language wd:Q9240; 
          wikibase:lemma ?lemma;
          wikibase:lexicalCategory ?category.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],ml". }
}
GROUP BY ?lexeme ?category ?categoryLabel
Try it!
Cheers, VIGNERON (talk) 15:24, 8 June 2021 (UTC)
@VIGNERON: Isn't the final "wikibase:lemma []" redundant (and leads to doubling of entries when there's more than 1 spelling?) ArthurPSmith (talk) 17:09, 8 June 2021 (UTC)
@ArthurPSmith: indeed it's not needed and double the entries, corrected above. My bad, I retake the example before where it's also redundant but didn't cause any problem, the grouping is making this problem surface. Cheers, VIGNERON (talk) 17:40, 8 June 2021 (UTC)

Gender of L502814

Hi y'all,

I stumbled on a weird case for the gender of Lexeme:L502814 (thanks to Fralambert).

This word has two variants, one common with a "k", one rare with a "c" (so far, nothing exceptional, some people say it should be two lexemes but I think the consensus is to put them on the same lexeme ; also I see that rare form (Q55094451) is sometimes used in instance of (P31) and sometimes used as grammatical feature…).

Then, the main variant is masculine, again no problem.

But the rare variant is said to be masculine according to one dictionary and feminine for one other dictionary (both dictionary being globally trustworthy).

I'm not sure how to best model this situation. For the moment, I settled for this model : https://www.wikidata.org/w/index.php?title=Lexeme:L502814&oldid=1440129406

What do you think ?

Cheers, VIGNERON (talk) 15:25, 12 June 2021 (UTC)

@VIGNERON: Qualifier subject form (P5830) is intended to be used only with usage example (P5831). I have a better solution. I have a whole list of nouns with ambiguous gender in Slovak. Some of these lexemes already have forms. As you can see in džínsy (L460043) and čary (L449500), I just add gender as grammatical feature the same way it's done in adjectives. Downstream dictionaries are expected to render two separate form tables, one for each gender. Another approach, which will be used for Slovak masculine nouns with both animate and inanimate senses (to be merged when in sense qualifier is available), is to add animate (Q51927507) and inanimate (Q51927539) as grammatical features to differing forms. In this case, gender in grammatical features will not match any of the lexeme genders, but anyone (or any code) with knowledge of Slovak language will understand it. — Robert Važan (talk) 09:47, 13 June 2021 (UTC)
@Robert Važan: I thought about something similar to your idea (grammatical feature is indeed an obvious and default choice) and I kind of like it but I see several flaws and problems.
First, this is not explicit and would confuse most user (and almost all bots), except for the small percentage understanding the language (even English, the most understood language, is understood by a small number of people: less than 20 % of the world population ;) ) .
Then, this is not referenced (which is already bad in itself, everything should be referenced) and grammatical features can't be referenced. Here since references are disagreeing and contradictory, it's even more important to put them.
Finally, your examples are not usual but not that unusual either. The case here is more exotic. It's not just the lexeme having 2 gender, it's one form having 2 gender (unlike your exemple where lexeme has 2 gender but each form has only one gender).
Since it's a test lexeme, I added the genders as grammatical features (see L:L502814#F1) but it looks very weird and probably not understandable even for most French speakers.
PS: for the first part, good point but this is a separate problem, I started a separate discussion on Property talk:P5830.
Cheers, VIGNERON (talk) 12:14, 13 June 2021 (UTC)
@VIGNERON: Note that grammatical features are combined by intersection. Intersection of masculine and feminine gender is empty. There should be two forms, one masculine, one feminine, with the same representation.
In Wikidata, some facts are stored while others are inferred. Storing all inferred facts is impossible, because there are too many of them. Inference rules are inevitably language-specific, which means that all bots currently have to be language-aware. Wikifunctions will be able to express inference rules, which will allow development of language-agnostic bots and user interfaces once inference rules for all languages are added to Wikifunctions.
As for references, isn't it sufficient to place them on the grammatical gender (P5185) statement? You could also add described by source (P1343) statements on the forms. — Robert Važan (talk) 12:42, 13 June 2021 (UTC)
@Robert Važan: is is always an intersection? is it documented somewhere? (I was looking for this when working on Breton where there is strane things like "plural of plural" but I couldn't find it). Plus, masculine/feminine are not really exclusive in French, a form can be both (all professions ending in -logue for instance). We could put twice the same form with one with masculine and one with feminin but this feels like unnecessary duplication and not grammatically correct : there is one masculine form and one feminin form, there is only one form identical in feminin and masculine. Putting this information true for all forms on the lexemes level in grammatical gender (P5185) seems more simple, more logical and better. No?
Not sure why you talked about inferrence, I see none here (and indeed I think we should remove probably at least a billion of "inferred data" from Wikidata but this is an other subject). Here, I'm exposing an example of basic date: this word is both masculine and feminine according to different sources, how should we best store this basic fact.
Cheers, VIGNERON (talk) 13:08, 13 June 2021 (UTC)
@VIGNERON: I talked about inference, because you are trying to choose representation of facts based on how easy they will be to understand to random users. Those random users however wouldn't be looking at raw stored facts. They will be viewing inferred facts (e.g. a nicely formatted form table), which will be much easier to understand. Ditto for bots. Raw stored facts will be seen only by editors, usually native speakers, and complicated cases like this will be exclusively handled by experienced editors. Those experienced editors will certainly comprehend gender in grammatical features.
I take back what I said about intersection of grammatical features. Do whatever works best for the language. Bots can be made smart enough to assume union wherever intersection would be empty.
I do however still recommend use of grammatical features in preference to qualifiers. Both approaches are logically correct, but grammatical features are easier to query and usually easier to edit. I am speaking from experience with heavily inflected language. A list of 12+ qualifiers (forms of Slovak nouns) would be unwieldy. The case for grammatical features might not be so strong in less inflected languages, so do whatever seems best for the language. — Robert Važan (talk) 15:11, 13 June 2021 (UTC)
@VIGNERON: PS: A form is unique not just when it has unique representation, but also when it has unique set of statements (attested in (P5323) comes to mind) or when it needs to have an ID to be referenced in statements. — Robert Važan (talk) 15:17, 13 June 2021 (UTC)

Are senses and forms subsets of lexemes?

This a bit abstract ontological question, but it has profound impact on how statements are structured. I am sorry for the long post, but this needs some explaining to even understand the question.

I have proposed in sense qualifier arguing that statement like "the sense is an instance of mass noun" sounds wrong, but is it really wrong? If lexeme is a unit of meaning and sense is also a unit of meaning, could senses be just subsets of lexemes? If senses are subsets of lexemes, does that mean any lexeme property could be in principle used on senses and vice versa? Could we meaningfully say that "the sense is countable" or "the lexeme is pejorative"?

If lexemes and senses are sets, what are members of those sets? Lexeme can be extensionally defined as a set of all occurrences of a word or phrase (in text, speech, etc.). Both senses and forms can be seen as subsets of those occurrences that meet certain criteria. This way the subset relationship seems reasonable. It feels odd, because intensional definition (e.g. forms defined by representation and grammatical features) makes lexemes, senses, and forms look like fundamentally different concepts. So if, from extensional viewpoint, we can say that lexemes, senses, and forms are all sets of occurrences, does that mean they are in principle interchangeable in statements? Could we say that "the sense is plurale tantum" or "the form has neuter gender" or "the lexeme is archaic" or even "the two lexemes are mutual translations"?

The extensional viewpoint can be taken further. Senses can be grouped via subsenses. Forms can be grouped by grammatical features or spelling and useful statements can be said about such groups, e.g. "comparative forms are rare" or "the spelling is non-standard since 1960". Qualifiers can be used to create statements about set intersections (e.g. "translation of present participle form of this sense of the verb") and inferred sets (e.g. "future tense is rare" where the future tense is an inferred compound form).

But before this apparent expressiveness can be exploited, we need to answer the basic question: are senses and forms subsets of lexemes? If that is not the case now, should they be treated as such for the sake of expressiveness and convenience? Would it break anything?

@VIGNERON, Fnielsen, Jon Harald Søby: This issue has impact on lexemes you have created.

@Lea Lacroix (WMDE), Lydia Pintscher (WMDE): This is more fundamental than usual content discussion. You might want to chime in. Fully realizing this idea might require code changes (notably property datatype that is a supertype of lexeme, form, and sense).

Robert Važan (talk) 00:24, 14 June 2021 (UTC)

Interresting philosophical question. I'll let others answers but already I can correct a point and give my point of view.
« If lexeme is a unit of meaning », here in Wikidata (different people used this word with different definition), a « lexeme is a lexical unit » (including but not limited to meaning, it obviously also include morphology-forms, etymology, pronunciation, etc., in a nutshell: any lexical data).
Yes, we could say "the sense is plurale tantum" (a bit strange and shortened, I would rather say 'this sense only use forms in plurale tantum") or "the form has neuter gender" (I may miss something but yes obviously, gender is a very common grammatcial feature of a form) or "the lexeme is archaic" (often forms are archaic but a lexeme in itself can be archaic, sense also could be) or even "the two lexemes are mutual translations" (senses are translatable, lexemes can be considered translatable only and only if all the senses are mutual).
If I understand right, you are trying to move out of the lexeme level thing belonging at the lexeme level (cf. discussion supra).
Cheers, VIGNERON (talk) 07:15, 14 June 2021 (UTC)
I think perhaps what Robert Važan is expressing here is that our data model for lexemes does not quite match the way language works. Not that anybody has a clear understanding of how language actually works anyway... I've been working on English proper nouns recently, and have run into various perplexing things that I don't know how to address. en:Proper and common nouns admits that "The detailed definition of the term is problematic". When they are plural are they still proper nouns (generally a pluralized proper noun refers to a group or class of entities that have some relation to the main identified entity; perhaps just sharing the same name)? Demonyms like Filipino are regarded as common nouns, but the same lexeme (?) is often a language which is a proper noun. So should that be two lexemes or one? If one, can we apply the grammatical categories at the sense level rather than the top level? What about when one sense only applies to the plural form, etc.? ArthurPSmith (talk) 15:10, 14 June 2021 (UTC)
@ArthurPSmith: This is an interesting example. Different lexical category usually means different lexeme even if etymology is the same. If however lexical category is just noun (Q1084) and subcategorization is done via instance of (P31) proper noun (Q147276)/common noun (Q2428747) statements, then Filipino would be a single lexeme with instance of (P31) statements moved to sense level, assuming we can treat senses as lexeme subsets. — Robert Važan (talk) 17:49, 14 June 2021 (UTC)
@ArthurPSmith, Robert Važan: I'm not sure to see the problem. Any proper nouns can have a plural (either a grammatical or a morphological plural it doesn't matter here, for example, « England yesterday/rural and England today/urban are two Englands », except maybe for names already in plural but again not always, the singular Netherland for the Netherlands also exist for instance). And for language/people, I would say it's two lexemes (at least, it's clearly two in French as the case differs, Français - uppercase F - is the people but français - lowercase f - is the language and some other data may also differ). Cheers, VIGNERON (talk) 19:31, 14 June 2021 (UTC)
@VIGNERON: Well the issue is what exactly is the definition of a proper noun in English - according to that enwiki article it's a word that "identifies a single entity", so as soon as you have multiple things identified it's no longer a proper noun. However, elsewhere I find definitions that seem to allow for plurals, as long as you are referring to specific identifiable entities (so in that case your "Englands" would be allowed). However if the term is being used generically, to mean countries that are like England in some way, or places named after England, then it would be considered common? It's actually rather confusing. ArthurPSmith (talk) 19:47, 14 June 2021 (UTC)
@ArthurPSmith: well nothing is never really unique, especially when it comes to language: anything can be *said* to be multiple (again, lexcially or morphologically). So I wouldn't bother to much with this Wikipedia article (or at least not without checking more references, when in doubt allways follow the sources). "Englands" exists and Lexemes need to describe it (or at least be able to describe it), the English Wiktionary as an entry for it and says it's a proper noun. And yes, the lexeme has multiple senses (I just added two on England (L156362)). Cheers, VIGNERON (talk) 20:07, 14 June 2021 (UTC)
@VIGNERON: Hmm, do we trust wiktionary or wikipedia more? I notice that this edit changed "Filipino" from having a proper noun and common noun meaning to just a single common noun entry. There seems to be widespread disagreement on all this! ArthurPSmith (talk) 20:16, 14 June 2021 (UTC)
@ArthurPSmith: touché, but I would tend to trust the Wiktionary a bit more when it comes to lexicography. Unless there is definitive and clear references, the solution is probably to just use the general noun (Q1084) as lexical category (at least temporary, worst case scenario, people would complain it's too general; but at least it's not wrong). Plus, the fact that it's a proper noun can be infered from the sense, maybe there is no need to store it explicitely. Cheers, VIGNERON (talk) 07:20, 15 June 2021 (UTC)
@VIGNERON: This reminds me that one of the motivations behind this proposal is lexeme granularity. Splitting lexemes, so that they can have slightly different statements or forms, is convenient and I used it a lot in Slovak lexemes to accelerate editing, but it is a hack that results in fake homographs. Lexemes should be split only if they are true homographs, i.e. they differ in coarse-grained lexical category or they have different etymology (or whatever is commonly considered homograph in that particular language). I am therefore in the process of merging a lot of Slovak lexemes. Unified lexemes are inevitably more complicated, because escaping that complexity was the reason to split them in the first place. This unification creates pressure on expressiveness. Hence my desire to put lexeme statements on sense level or restrict them by grammatical features to subsets of forms. The idea of treating forms and senses as subsets of lexemes holds promise of a conceptually clean solution. It is an alternative to more hacking and patching, which is how I now see my "in sense" proposal - as another hack, a workaround for flawed ontology. — Robert Važan (talk) 20:53, 14 June 2021 (UTC)
@VIGNERON: Practical implications of this discussion are extensive and probably not worth discussing in detail before the basic ontology question is answered. I do however have practical applications in mind, notably the issue you raised in Property talk:P5830 and my "in sense" proposal. I will just briefly summarize impact of treating senses and forms as subsets of lexemes. Properties accepting lexeme as their subject (remember statements are subject-property-object triples) would also accept sense and form, so that statements can be moved to sense/form level. Only two qualifiers are needed, "sense" and "form" (and perhaps "grammatical feature", "spelling", and others for supersets of senses and forms), because the goal is always the same: specify a subset of statement's subject that the statement applies to (complementary qualifiers would be needed to subset statement's object). Property datatype would be less important, because lexeme-valued statements can be qualified to narrow them to sense/form while sense/form-valued statements can be duplicated to cover the whole lexeme. Property datatype would be a practical rather than ontological choice, optimized for brevity and other non-ontological goals. — Robert Važan (talk) 17:49, 14 June 2021 (UTC)
@Robert Važan: sorry but I understood almost nothing and I guess I disagree with the few I think I got. Statements - that are not just triples BTW (qualifiers is one good example of the extension of statements behond the claim (which is the true RDF triple here) and I feel like your in sense proposal is just subject sense (P6072) (or at least close enough to extend this existing property). Cdlt, VIGNERON (talk) 19:31, 14 June 2021 (UTC)
@VIGNERON: The above summary is too terse, so let me try with an example instead. An earlier proposal for Wikidata lexemes is based on the lemon model, a sort of ontology for the domain of lexicography. It defines senses and forms as independent objects linked to lexeme only by special property. There is no "subset of" nor part of (P361) relationship between senses/forms and lexemes. So if the lexeme is a noun, nothing in the model says that its senses and forms are nouns too. Under lemon model, lexical categories are classes of lexemes, so it is actually incorrect to claim that some sense is a noun, because sense is not a lexeme but rather something linked to lexeme. Under lemon model, which was realized in Wikibase/Wikidata, adding instance of (P31) mass noun (Q489168) statement to a sense, e.g. L5355-S1 of tea (L5355), is wrong, because the sense is not a noun, the lexeme is. This flaw of the model motivated the original "in sense" proposal. What I am considering here is to treat both sense and its lexeme as a set of occurrences, assume "subset of" relationship between them, and treat noun as a class of such sets that meet certain criteria. With this modification, saying that tea (L5355) is a noun implies that L5355-S1 is a noun too and it is therefore meaningful to say instance of (P31) mass noun (Q489168) on the sense.
Proposed "in sense" qualifier would be unnecessary, at least for simple cases like this. It would still be of use in more complicated cases, but the set view of senses makes it obvious it would be equivalent to subject sense (P6072), something that was obscured by intensional definition of both qualifiers. Property subject sense (P6072) could then be renamed "subject sense" while object sense (P5980) would be renamed "object sense" (following the pattern of subject has role (P2868) and object has role (P3831) and other such pairs). There would be little reason to provide further sense-valued qualifiers. — Robert Važan (talk) 23:04, 14 June 2021 (UTC)
@Robert Važan: I'm not a specialist of the lemon model, but this official documentation page https://lemon-model.net/lemon-cookbook/node44.html and in partcular the sentence « We then define the set of senses in the lexicon,  » seems to indicate that *there is* a "subset of" relationship, no? I think I understand the tea (L5355) example and I think I would put the massness on the lexeme level with qualifier indicate the sense where it applies (same for the countability, and probably the "collectiveness" too). Cheers, VIGNERON (talk) 07:11, 15 June 2021 (UTC)
@VIGNERON: The expression you quoted merely says that sense is a relation between lexemes and Q-items. It actually confirms that sense is not a subset of lexeme in the lemon model. Why would you choose to keep statements on lexeme level with qualifiers instead of putting them on sense level without qualifiers? Isn't the latter easier to edit and query? — Robert Važan (talk) 09:18, 15 June 2021 (UTC)
@Robert Važan: I'm lost, doesn't the subset symbol means subset? (and this quote, definitely can't talk about Q-items, it was written before Q-items were invented ;) ) I also found this even more explicit quote: « Secondly, a sense can be regarded as a subset of all the uses of the given lexical entry that refer to the concept in question. » (source).
I do think it's easier to query (trivial in SPARQL and still quite easy in Lua or with the API) and to edit and maintain (especially when there is a lot of senses or forms), but much more importantly I think that data for the whole lexeme should be statements at the lexeme level.
Cheers, VIGNERON (talk) 09:34, 15 June 2021 (UTC)
@VIGNERON: in is the set of all senses, not an individual sense. is not an individual lexeme but rather the set of all relations between ontology entities (what Wikidata calls Q-items) and lexemes.
The paper you linked is an interesting read and it does show that lemon authors thought of the set-of-occurrences interpretation of senses, but the model itself does not reflect that view. You can easily see that, because no property in the model can migrate between lexeme and its senses and forms.
There are indeed cases when lexeme-level statement with qualifiers is the simpler solution. It's perhaps too early to discuss preferred representation. I think it depends somewhat on property, language, and use case. One example when having everything on lexeme level does not work so well is wikt:juice, which needs countable/uncountable statement on every sense and also usually uncountable statement on lexeme as a whole (which can be modeled with nature of statement (P5102) mainly (Q91013007)). Expressing all that on lexeme level would require 3-4 statements about countability.
But before preference can be even considered, there's still the basic question of whether moving statements between lexeme and its senses and forms is ontologically correct. Would it somehow fundamentally break something on theoretical or practical level? Will I dig myself a trap if I start adding sense statements with properties/classes that are usually found on lexeme level? Is there a gotcha down the road? I can sure encounter resistance to extending property scope to senses/forms, but is there a foreseeable rational basis to such opposition? You say that "data for the whole lexeme should be statements at the lexeme level", but is it really data for the whole lexeme when you are listing senses it applies to and Wiktionary shows the data next to senses? Having two ways to express the same fact indeed makes the data harder to use, but I think that is a more general problem with Wikidata that will be eventually solved with inference. — Robert Važan (talk) 06:46, 16 June 2021 (UTC)

One possible gotcha I can foresee myself is that since there will not be an item for every sense, senses might end up serving as ad hoc Q-items, clarifying meaning with properties that were originally intended for use on items. This can go as far as turning senses into new concepts that are non-trivially linked to Q-items via statements. Theoretically, properties/classes borrowed from items might clash with properties/classes borrowed from lexeme level. I suspect reasoners can already deal with such multi-faceted entities, but I don't know enough about the field to be sure. — Robert Važan (talk) 10:19, 16 June 2021 (UTC)

That's an interesting discussion here. I think you want to structure the senses at a larger level but it is not possible in Wikidata Lexeme. As modelized here, if the grammatical data show a difference, it's separated lexemes. You are not suppose to gather senses under a lexeme and point grammatical data to a sense. Sense are not concerned by syntactic data. You may rather need a super-level to gather the lexical entries: the lexicographical entry. I suggest you the module Lexicog for Ontolex model. It was developed exactly to challenge this issue. It keeps separated syntactic data and semantic data, and it's much better this way. In my opinion, mixing those (and pragmatics considerations) is a terrible idea. Noé (talk) 22:02, 16 June 2021 (UTC)

@Noé: I took only a brief look at Lexicog, but it seems to be intended for conversion of existing dictionaries, not to improve modeling of language. Splitting lexemes over small grammatical differences will result in a lot of boilerplate. Stored data should be concise to ease editing. Ease of consumption is the job of query engines, reasoners, and derived datasets.
I agree that brevity can go too far. Having a sense that is both a noun and a building is indeed a bad idea. It is less clear with specific properties like sex or gender (P21) used in addition to gender-neutral item for this sense (P5137). In any case, if I were to choose, I would prefer senses to carry lexical information rather than ontology information (both of which can be semantic). — Robert Važan (talk) 06:33, 17 June 2021 (UTC)

How to model augmentative/diminutive inflection for nouns?

Currently, adjective forms are marked with the grammatical features to identify their inflection. For example, "happy" → positive (Q3482678), "happier" → comparative (Q14169499), "happiest" → superlative (Q1817208).

However, for nouns it's not clear how to model the similar inflection for nouns. For example, in Portuguese we can have "carro", "carrinho", "carrão", and so on for most nouns. While there's diminutive (Q108709) and augmentative (Q1358239), respectively, for the latter two forms, there doesn't seem to be a way to identify the base form (similar to positive (Q3482678) for adjectives). Is there an existing item that could be used for this which I'm not aware of, or should we create one? (Pinging EnaldoSS from previous discussion.) --Waldyrious (talk) 10:29, 20 June 2021 (UTC)

@Waldyrious: I don't know if I totally understood your question. You didn't mention, but sometimes diminutive (Q108709) and augmentative (Q1358239) can also be used in Portuguese adjectives (see bonito (L474536), rico (L474615) and rápido (L474618)). So, if these two degrees can be used in both nouns and adjectives, why the positive (Q3482678) could not in the nouns as well? Enaldodiscussão 11:40, 20 June 2021 (UTC)
I'm not saying that positive (Q3482678) definitely cannot be used for nouns, but I didn't find any source referring to the regular inflection of nouns (Portuguese or otherwise) as being in the "positive" form. Without such "prior art", I'm afraid it would be original research at best, and misleading/incorrect at worst, to use positive (Q3482678) as a qualifier for nouns. So I guess what I'm seeking here is a little reassurance to validate that practice, or an alternative solution that may be eluding us. --Waldyrious (talk) 15:41, 21 June 2021 (UTC)
@Waldyrious, EnaldoSS: Modeling of diminutives and augmentatives in Wikidata is not consistent. Sometimes, diminutive (Q108709) is a grammatical feature (mostly Portuguese, some Dutch, rare few in other languages). Sometimes it's a value of language style (P6191) or instance of (P31) on senses (Czech). I guess most people (including myself) just create new lexemes for diminutives and augmentatives. Separate lexemes can be linked via derived from lexeme (P5191) with mode of derivation (P5886) set to diminutive (Q108709) or augmentative (Q1358239), but nobody is doing that at the moment.
I am not a professional linguist, but AFAIK placing diminutive (Q108709) and augmentative (Q1358239) in the list of grammatical features lacks theoretical grounding. Diminutives and augmentatives are considered separate lexemes, because they are not automatically available for all nouns (contrary to number and case) and because of the shift in meaning. Lack of theoretical grounding shows in absence of suitable Q-items to identify neutral form (positive (Q3482678) is definitely incorrect) and for double diminutive. It also shows in absence of corresponding grammatical category (similar to comparison (Q577714) or grammatical number (Q104083)). — Robert Važan (talk) 21:44, 21 June 2021 (UTC)

Term for gender antonyms in grammar?

Is there a specific term for denoting words of opposite gender (en:wikt:Wiktionary:Semantic relations), especially in languages that have gendered nouns? For example: "man vs woman" or "dude" vs "dudette". In Malayalam language, it is termed "എതിർലിംഗം" gender antonym (Q107209963) and it is given its own identity and equal importance as antonyms, synonyms and homonyms. I have seen that often antonym is used for this purpose. But is there a specific word for this relation? Vis M (talk) 18:18, 23 June 2021 (UTC)

@Vis M: Note there is a difference between morphological pairs (waiter-waitress) and semantic pairs (brother-sister). They mostly overlap, but there are exceptions, for example king-queen is semantic only while rodič-rodička in Slovak and Czech is morphological only. Also see previous discussion about "feminine form". I think w.r.t. morphological pairs, everybody is just waiting to see how far we can get with combines lexemes (P5238) and derived from lexeme (P5191), especially when mode of derivation (P5886) is set to something like gender inflection (Q1124523) (a misnomer - it's derivation, not inflection). For semantic pairs, see suggestion by Liamjamesperritt in the linked discussion: antonym (P5974) qualified with criterion used (P1013) = gender binary (Q5530970) (although IMO gender (Q48277) would suffice). Some of the gender pairs also qualify as relational antonyms. Antonyms along multiple axes could be rendered in Wiktionaries similarly to how it's done in mother entry. — Robert Važan (talk) 20:02, 23 June 2021 (UTC)
Thank you! Vis M (talk) 21:32, 24 June 2021 (UTC)

Sense and form qualifiers

@VIGNERON, ArthurPSmith, Fnielsen: I am proposing generalization of existing four sense/form qualifiers. Generalized interpretation (and property description) is "qualified statement applies only to listed forms/senses of statement's subject/object", which is clear and importantly unambiguous. @VIGNERON: already proposed this for subject form (P5830). Similar argument holds for the other three qualifiers. I've suggested alternatives, but they will not work in every case. This proposal is informed by the above discussion about nature of senses/forms, but it is valid even if you don't subscribe to the subset view of senses/forms. If this change is implemented, my earlier "in sense" proposal will be withdrawn as a duplicate of "subject sense". There will be little need to introduce other sense/form-valued qualifiers. The subject/object naming pattern is inspired by other such property pairs.

Qualifier Renamed to Datatype Subject type (allowed-entity-types constraint (Q52004125)) Object type (not enforced)
subject sense (P6072) subject sense Sense Wikibase lexeme (Q51885771) + Wikibase form (Q54285143) *
subject form (P5830) subject form Form Wikibase lexeme (Q51885771) + Wikibase sense (Q54285715) *
object sense (P5980) object sense Sense * Wikibase lexeme (Q51885771) + Wikibase form (Q54285143)
object form (P5548) object form Form * Wikibase lexeme (Q51885771) + Wikibase sense (Q54285715)

  •  Comment Please add comments/votes. — Robert Važan (talk) 17:11, 18 June 2021 (UTC)
    @Robert Važan: Can you provide a specific example of what you're trying to do? I'm not following the purpose here. I don't think we constrain the domain of qualifiers generally do we, the issue would be what is the domain of the underlying property they are qualifying (usage example (P5831) for the "demonstrates" cases)? And I'm not understanding your "range" issues at all - isn't "range" = datatype for these? But these don't match the labels ??? ArthurPSmith (talk) 13:49, 21 June 2021 (UTC)
    @ArthurPSmith: I thought it would be understood that domain/range applies to subject/object of the statement, not the qualifier. I've changed the table to make it more obvious. Examples for "subject sense" are listed in my "in sense" proposal. Examples for "subject form" are in VIGNERON's comment on subject form (P5830) talk page. Both "object form" and "object sense" are already used to qualify two different properties even though their current names suggest otherwise. I am also considering more uses for "object form", but I don't want to get into that here. In any case, it is good to change the four qualifiers together, so that they form a complete set and so that this doesn't have to be revisited every time someone finds new use for one of the qualifiers. — Robert Važan (talk) 15:10, 21 June 2021 (UTC)
  •  Support Ah, got it, thanks for the explanation and clarifying renaming! ArthurPSmith (talk) 16:48, 21 June 2021 (UTC)
  • @VIGNERON: Would this solve the problem you described in Property talk:P5830? — Robert Važan (talk) 20:06, 23 June 2021 (UTC)

@VIGNERON, ArthurPSmith: ✓ Done I have modified the properties according to the above table. — Robert Važan (talk) 15:48, 26 June 2021 (UTC)

Need help with query (2)

Hi, in the absence of some kind of monitoring pages for all new lexemes (or recently changed lexemes) in a specific language, we can only rely on external tool such as Wikidata Query Service. How can I turn the query below to sort by newest to oldest? I tried to sort by ?lexeme, but it seems to treat it as string sort, not integer sort. Could anyone help with this simple query? Thanks! Bennylin (talk) 14:52, 24 June 2021 (UTC)

The following query uses these:

  • Items: Javanese (Q33549)  View with Reasonator View with SQID
    SELECT ?lexeme ?lemma ?category ?categoryLabel WHERE {
      ?lexeme dct:language wd:Q33549; 
              wikibase:lemma ?lemma;
              wikibase:lexicalCategory ?category;
              wikibase:lemma [].
      FILTER(LANG(?lemma) = "jv")
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],ml". }
    }
    ORDER BY ?lexeme
    LIMIT 100
    
@Bennylin: This is what I use for Slovak language:
SELECT ?lexeme ?lemma ?modified
WHERE {
   ?lexeme dct:language wd:Q9058; wikibase:lemma ?lemma; schema:dateModified ?modified.
}
ORDER BY DESC(?modified)
LIMIT 1000
Try it!
It kind of works, but there's no way to retrieve comments and user names using SPARQL alone (which would be IMO useful for a lot of other use cases). — Robert Važan (talk) 16:33, 24 June 2021 (UTC)
@Bennylin: You can use SPARQL string manipulation to BIND a numeric value and sort on that:
SELECT ?lno ?lexeme ?lemma ?category ?categoryLabel WHERE {
  ?lexeme dct:language wd:Q33549; 
          wikibase:lemma ?lemma;
          wikibase:lexicalCategory ?category .
  FILTER(LANG(?lemma) = "jv")
  BIND(xsd:integer(substr(str(?lexeme), 33)) as ?lno)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],ml". }
}
ORDER BY ?lno
LIMIT 100
Try it!
ArthurPSmith (talk) 17:15, 24 June 2021 (UTC)
Marvelous, friends. Thank you very much! Wikidata should make a gallery/list of useful queries like these! Bennylin (talk) 07:26, 25 June 2021 (UTC)

Klingon nouns and forms

I am not a Klingonist (Q41496390). this is a linguistic question. but as far as I understand, nouns can have 5 different types of suffixes in a specific order.

Examples
Stem Suffix Type Meaning English approximation
DI'raq 1 base form Sheep
DI'raq ʼaʼ 1 augmentative (Q1358239) A floc of Sheep
DI'raq Hom 1 diminutive (Q108709) A small sheep
DI'raq vav 1 endearment suffix my friend sheepy
DI'raq mey 2 plural (depends on if it is sentient) multiple sheep
DI'raq Hey 3 indicating its no really a sheep ‟sheep”
DI'raq qoq 3 dubitative (Q1263049) the ehm… sheep? 🤷‍♂️
DI'raq naʼ 3 indicates the speaker is certain, what it is a real sheep!
DI'raq raj 4 who owns it (depends on the person, number of the owner and if it is sentient) their sheep
DI'raq vaD 5 benefactive case (Q664905) for the sheep

These suffixes can be combined to create a specific meaning (but only one of each type).

for instance: DI'raqHomHeyrajvaD could mean for their alleged small sheep.

Now my question is, does each valid combination of suffixes constitute a form of a lexeme? --Shisma (talk) 18:24, 28 June 2021 (UTC)

@Shisma: That's polysynthetic language. There are real languages that are polysynthetic. Affixes and other morphemes can be modeled as Wikidata lexemes with lexical category set to suffix (Q102047) or so. As for the lexemes representing the main meaning, if it is impractical to list all forms, you can list either just the basic forms (without affixes), some core set of key forms, or only attested forms. Choose whatever best fits the language. — Robert Važan (talk) 09:12, 29 June 2021 (UTC)

@Robert Važan: Thanks for your help. I was hoping somebody would help me with such decisions. Since Klingon has no irregular forms (or has it?) I would propose all lexemes in Klingon should be without forms. Affixes should have their own lexeme. Would you agree? --Shisma (talk) 15:04, 29 June 2021 (UTC)

@Shisma: Do whatever fits the language. At least one form (identical to the lemma) is still required. Otherwise the lexeme shows up on reports as incomplete. — Robert Važan (talk) 16:41, 29 June 2021 (UTC)

What is a paradigm? And why is it assumed to be either inflection or conjugation class?

Wikidata claims that first declension noun in -ā (Q3921592) is an instance of inflected form (Q4423888). This is a consequence of first declension noun in -ā (Q3921592) being an instance of inflection class (Q56633378), which is a subclass of paradigm (Q1428334), which is a subclass of inflected form (Q4423888). It's obviously wrong, but looking for a way to fix this makes me wonder what exactly is a paradigm. Articles linked from paradigm (Q1428334) look like lengthy disambiguation pages that define paradigm as any of the following:

  1. A set of all forms of a particular lexeme, sometimes called morphological paradigm. For example, paradigm of go includes go, goes, going, gone, and went.
  2. A subset of lexeme's forms. For example the set of all conditional forms of a particular verb.
  3. A set of forms that are typical of some lexical category. For example, paradigm of English nouns has two forms, singular and plural.
  4. A pattern for constructing lexeme forms, usually by adding affixes to word stem. For example, English verbs have paradigm -0, -s, -ing, and -ed.
  5. A model lexeme demonstrating particular pattern for constructing lexeme forms. For example, feminine noun model žena in Slovak generates different forms than model ulica.
  6. A grouping of lexemes based on common inflectional traits without prescribing any inflection rules. This may be as broad as a "class of irregular verbs".
  7. A concept similar to synonym set, orthogonal to syntagma, and used together with syntagma.
  8. Any set of lexemes sharing some grammatical or semantic trait.

There is no single concept that covers all of these definitions, but most of them can be summed up by saying that paradigm instance is a set of forms (modeled in Wikidata as lexeme entity), paradigm class is a set of form classes (modeled in Wikidata as paradigm (Q1428334)), and most of the meanings listed above are subclasses of paradigm (Q1428334). So IMO, paradigm (Q1428334) has part(s) of the class (P2670) inflected form (Q4423888) rather than being a subclass of inflected form (Q4423888). And individual inflection/conjugation classes should be subclasses of paradigm (Q1428334) rather than instances of it, because they merely tighten constraints on possible forms in the paradigm.

Then there's this assumption that paradigms are either inflection or conjugation classes, which is embodied in properties paradigm class (P5911) and conjugation class (P5186). Slovak language has no concept of inflection class. It has only model lexemes (vzor in Slovak), which cover all regular declension patterns. I can define them as direct subclasses of paradigm (Q1428334), but paradigm class (P5911) property requires inflection class (Q56633378) as its value, not to mention it is inappropriately named. Slovak verbs do have conjugation classes, but they are further subdivided into individual model verbs. I can link the conjugation class via conjugation class (P5186), but then I lose information about model verb. Or I can use conjugation class (P5186) to link to the model verb, but then the property seems to be used inappropriately. I think we need a more general property, perhaps named "lexeme paradigm", that would allow any paradigm (Q1428334) as its value. If I repurpose paradigm class (P5911) as this new "lexeme paradigm" property, would it break anything?

@Yurik, Jon Harald Søby, Kriomet, Fnielsen, VIGNERON, Nikki: Pinging heavy users of paradigm class (P5911) and conjugation class (P5186).

Robert Važan (talk) 13:57, 30 June 2021 (UTC)