Wikidata talk:Lexicographical data/Archive/2023/01

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Lexeme to merge

Hi,

Our rules for separating/merging lexemes are not ultra-clear, so I want to check here.

Is it ok to merge twenty-four seven (L44134) and (L44135) ? For me "24-7" and "24/7" are just variant of the same lexeme (same language, same lexical category, same meaning, two identifiers in the Merriam-Webster but it is said to be variants of the same lexeme).

@SixTwoEight, Rachmat04, UWashPrincipalCataloger: who edited these two entities.

Cheers, VIGNERON (talk) 12:10, 26 November 2022 (UTC)

I'm not sure about English, but at least in Swedish "24/7" has the additional (or I would rather say original) sense of "July 24", while "24-7" does not. The separator (slash or dash) is thus significant in some languages. In Wikidata, I see that "24/7" appears in multiple instances as a proper name or title of a work, and while proper names maybe typically aren't listed with multiple senses, some cases such as Georgia (L254165) exist. There may also be lesser known uses of either in scientific contexts (ESO 24-7 (Q80020742) etc). I suggest keeping them apart also in English, both to play safe and for consistency with other languages. SM5POR (talk) 14:10, 26 November 2022 (UTC)

Very true, I didn't think about that. Indeed in most languages 24/7 could be a date, but I think it's a separate lexeme, not a sense of the (L44135) (and I'm not sure dates are lexemes, but this is yet an other subject). Plus, even merged, there would be separate forms, so there would be little to no confusion. Finally, let's go back to sources: at least three dictionaries Merriam-Webster, Collins and Dictionary.com say it's variants (and list others variants), do other sources say otherwise? Cheers, VIGNERON (talk) 18:43, 26 November 2022 (UTC)

The date and the always-open expression have to be different lexemes because they belong to different lexical categories or differ on some other significant property affecting the selection of forms. That happens to regular words as well, more often in languages with multiple grammatical genders, such as German or Swedish, than in English.

"24/7" (or "24-7") is a rare case of a numerical expression having become an adverb through abbreviation (of "24 hours a day, seven days per week"), so the observation that this adverb has only a single sense cannot be generalized into the choice of typographic delimiter in a compound lexeme being independent of its sense (somewhat like "a non-rectangular flag is always red, white and blue, regardless of what nation it represents" when there is only one such nation). Until you have made a thorough search of all the world's literature, you cannot claim with certainty that there isn't a single case of a written expression with two senses differing only in the delimiter, and you can never be sure one won't appear in the future.

For this reason, also lexemes with only a single sense come equipped with an Lxxxxx-S1 sub-entry for all the sense-related statements; you don't move those statements up to the base lexeme entry just because there is currently no risk of confusion with a second sense. If and when a second sense appears, you simply add an Lxxxxx-S2 entry without having to rebuild the structure of the entire lexeme.

Whether reputable sources consider the expressions "variants" or not is of minor importance to me; the criterion should be what is practical to do in Wikidata. I consider the choice of delimiter similar to that of letter case; HERTZ, Hertz and hertz are different renderings of the same name/word depending on written context, but only the unit of frequency is ever written "hertz", while the ones with an uppercase 'H' can refer to either the unit, the family, or the car rental firm. Do they make up one, two, or three lexemes? Other examples: "3", "3rd", "iii" and "III"; are they ordinal or cardinal numerals? The letter "Å", the Swedish noun "å" (meaning small river), the locality "Å" and the angstrom unit abbreviated "Å"? Tom & Jerry vs Tom and Jerry? The white house vs the White House? 2,022 vs 2022? SM5POR (talk) 00:21, 27 November 2022 (UTC)

@VIGNERON Merging these is a good idea, and I would even go as far as to suggest making the headword/lemma “twenty-four seven," with both 24/7 and 24-7 as alternative forms.

While the adverb is not as commonly written out in full like this, I think it is important to keep in mind that lexicographical data represents spoken as well as written language. The only way we can say this expression is “twenty-four seven” and not “two four seven” or “twenty four over seven” and so on. That we can format the numbers in different ways does not really change what they are supposed to represent, and without the full words on the same lexeme it is not really indicated that this consists of the words twenty-four and seven, not two, four, and seven. I do not think we need to be concerned that 24/7 looks like a date because this is incidental. It could also be a fraction, or a house number, or type of airplane, etc. but none of this is really related to the context of what 24/7 represents here.

I would also add that as far as consistency with other languages is concerned, it would be easier to make connections between languages from a single “twenty-four seven” lexeme to others. Many languages have adjectives or adverbs with similar or equivalent meaning that are not typically abbreviated, like Punjabi اٹھپہرا which is derived from اٹھ eight and پہر which is a unit of time equivalent to three hours. عُثمان (talk) 22:05, 27 November 2022 (UTC)

I would also add that “twenty-four seven” exists as two parts of speech and should have an additional adjective lexeme. (As in, “twenty-four seven business,” “twenty-four seven service,” etc.)

The adverb and possibly also the adjective have an additional figurative sense. As in, “he talks 24/7,” where just like with “all the time” and “constantly,” this is probably an emphatic exaggeration and the subject does actually stop talking to sleep. عُثمان (talk) 22:21, 27 November 2022 (UTC)

@عُثمان: thanks, merge done (of L:L44134, L:L44135, L:L201811 and L:L201812).

@SM5POR: not sure to understand all your examples but two lexeme can have the same lemma and the same lemma can also belong to two lexemes (and vice-versa for different lemma). For instance, you can have one lexeme with "hertz" (unit) and an other lexeme with "Hertz" (name), no problem about that and not similar to the case here (storing of lexeme variation). ~~For Å, letters are not usually considered lexemes (see previous discussions on that topic)~~, and again you can have several lexemes, some wiht the same form or not (for instance "Å" for the place with only one form and "angstrom" with to forms "angstrom" and also "Å"). Not sure "Tom and Jerry "or "White House" are lexemes (and "white house" - as just a house that is white - is clearly not a lexeme). Again, all these examples seems very loosely related to the case here.

Cdlt, VIGNERON (talk) 10:50, 8 January 2023 (UTC)

@VIGNERON: No objection; I was merely thinking aloud. I have since realized that it's easy to add multiple lemmas, averting some of my concerns. As for White House (L43450), that's recognized as a proper noun, though I was surprised to find out that the only sense of that lexeme refers to a city in Tennessee, not the presidential residence in District of Columbia. Neither "Tom & Jerry" nor "Tom and Jerry" seem to be lexemes, though both appear as animated cartoon and comic titles. Is there a formal definition of "lexeme" somewhere indicating how it relates to compound expressions such as proper nouns, commercial and scientific terms with multiple parts/words ("Charlie Chaplin", "New York", "hot dog", "proper noun", "part of speech"), Latin expressions ("lingua franca", "curriculum vitae"), popular sayings, greetings ("no smoke without fire", "good day", "Merry Christmas", "May the Force be with You") etc?

The connection I see with the question at hand is that digits, punctuation and other non-alphabetic typographical symbols sometimes have a role in the definition of what constitutes a "word" and thus perhaps also a "lexeme"; consider the Russian practice of using a hyphen when transliterating foreign compound names ("Нью-Йорк"). There are symbols made up by multiple letters or other graphemes ("&", "@", "$", "£", "%"). In Swedish, compound nouns are simply joined without spaces forming singular words such as privilege of formulating a problem (Q10639082) ("problemformuleringsprivilegium"), meaning that I hardly ever encounter multi-word lexemes in my native language. Hence my concern over things like "24/7" vs "24-7". SM5POR (talk) 15:30, 8 January 2023 (UTC)

@SM5POR: no problem. For the definition of "lexeme", since the 1940s a lot as be written on that concept and on Wikidata we have this page Wikidata:Lexicographical data/Notability (sadly, still a draft). And yes, ponctuationcould have a meaning and it could lead to different lexemes ("French" and "french" in English for instance), but it's doesn't automatically mean it's different lexemes. Bonus, here a list of 119 lexemes currently in Swedish with a space in the main lemma: https://w.wiki/6CcW . Cheers, VIGNERON (talk) 16:21, 8 January 2023 (UTC)

Yes, and we have this popular saying "Wikidata is always right". :-)

As to the list of 119 Swedish "lexemes", that's the kind of thing that makes me ask. I expanded the query to see the lexical category, and while there are indeed both "nouns" and "verbs" in the list, they are multi-word expressions to me, often formed by an adjective + a noun ("digitalt objekt", "invasiv art") or a verb + a preposition or adverb ("ta i", "stanna kvar"). Sometimes the category is labelled "adverbial phrase" or "verb phrase"; that's better than "adverb" or "verb" in my opinion. The Swedish term for "part of speech" is "ordklass", or literally the class of the word (singular); there is no way I will recognize "invasiv art" ("invasive species") as a noun only, since it consists of two words: an adjective followed by a noun. It may pass as a "noun phrase", but that's all, and I don't see why it should be considered a lexeme. And invalid ID (L593757) sure isn't a noun; it's an idiom or a phrase vartåt det barkar (L593756) (consisting of an adverb, a pronoun and a verb). As is "Wikidata is always right", I guess. SM5POR (talk) 18:10, 8 January 2023 (UTC)

Lexemes outside their usual category

Hi,

It's not uncommon for a lexeme to beused oustide its "usual" category. My question is: how to store this data in Lexemes (and should it even be? it's more gramatical then lexical).

More precisely, in Breton, verbs don't really have an infinitive (it's sometimes called a verbal noun, but not always... and it doesn't seems to be exactly the same as verbal noun (Q1350145)). This "infinitive" in Breton has a double value both as a verb and a a noun. For now, I store the data like this on troazhañ (L941685) with an S1 for the verb part and an S2 for the noun part but I'm not entirely convinced. Does anyone has other idea? (I heard Arabic might be in a similar situation, is it?).

Cheers, VIGNERON (talk) 09:31, 8 January 2023 (UTC)

I know neither French nor Breton; I guess Breton/Grammaire/Conjugaison/Infinitif doesn't provide a clue then? What's the problem with leaving it as a verb form only; is it the linking via item for this sense (P5137) that bothers you?

As I have been wondering for some time what we should do with those maybe 50,000 instances of term (Q1969448) that I think mess up the otherwise language-independent part of Q-space, in my opinion they should be moved over to the lexeme section, statements and everything, and be told to stay there only; then they can link back to the things they represent via item for this sense (P5137), the property that implements the map–territory relation (Q1963130) (because the word is not the thing).

One problem: There will hardly be enough items in Q-space to correspond to every sense of a lexeme in even a single language (while 100,000,000 items may seem sufficient, a very large portion of those aren't lexemes at all as they denote more complex concepts). Many are nouns, but I suspect that there are a lot of adjectives and verbs without any item representation at all.

Rather than create items for all those adjectives and verbs, which will be a chore, I'm thinking about how to best reuse items for multiple senses and parts of speech. The item liquid water (Q29053744) may thus serve the semantics of not only the nouns water (L3302), wet (L330410) and wateriness (L330333), but also the verbs water (L3303), wet (L25304), adjectives watery (L342461), wet (L3316), adverb waterily (L203644) and so on. It wouldn't surprise me if it's already done this way in a number of languages. Fitting your Breton infinitive verb-gone-noun into this pattern would be a piece of cake. Can we cooperate on this somehow? I imagine we could use some aspect modifiers to tell senses of the same part of speech linking to the same item apart, say by means of a qualifier subject has role (P2868) ("identical to", "visually similar to", "cause of" etc) attached to the item for this sense (P5137). And then we could even implement qualifiers for 15 different qualities of snow (Q7561) or whatever they have in Greenlandic (Q25355) according to some linguistic urban legend I have heard... SM5POR (talk) 19:33, 8 January 2023 (UTC)

@VIGNERON: By the way, please see my proposal in Help talk:Deprecation#Refers to lexicographic sense which is part of this plan. SM5POR (talk) 19:46, 8 January 2023 (UTC)

In fr:v:Breton/Grammaire/Conjugaison/Infinitif, we can read « Comme en français, l'infinitif est parfois employé comme un substantif : (debriñ, evañ) an debriñ hag an evañ (le manger et le boire) ; (aozañ) ober un aozañ (faire une réparation). » which a simplification (it's a wiki page, I'll fix that later) but indeed it points to my question. In Q112161504 (section 345, page 168) Francis Favereau (Q3081429) has a more in depth explanation (in short) : « Nom verbal, cette appellation est préférable, en fait, à celle d'infinitif, car elle rend bien mieux compte de la spécificité du breton, notamment de la syntaxe de cette forme verbal, si proche parfois de celle du substantif. En effet, le nom verbal a une double valeur, de verbe bien sûr [...] mais sa syntaxe est bien souvent celle d'un nom » (other grammarians have more or less the same point of view). Thus is not « verb-gone-noun » (in that case, I would simply create two lexemes, like sleep (L1342) and sleep (L27105)) but more « form of verb-and-noun ». That's this "and" which is unusual (but not that unusual) that I would like to know how to model exactly. Cheers, VIGNERON (talk) 20:15, 8 January 2023 (UTC)

What I mean is, when you link the (still single) Breton sense of a verb to its chosen item in Q-space, you should invoke a modifier that may be specific to this Breton verb-and-noun part of speech that a Breton language engine can interpret to determine the semantic relationship between the sense and the item. I'm not sure if this makes sense (!) since I neither speak Breton nor have any practical experience trying to make an AI engine "read" natural language and translate it into semantic data structures. But I think it should relieve you from the chore of creating multiple senses just to deal with a grammatical ambiguity; just link to one item, state it's "Breton-ambiguous" and move on to the next lexeme. SM5POR (talk) 20:55, 8 January 2023 (UTC)

To spell out what I mean:

Create a Q-item (if there isn't one already) for this Breton-specific grammatical concept, label it "Breton verb-noun ambiguity" or whatever you find appropriate in your preferred language.
For every Breton verb that is "afflicted" with this concept and you have found a semantic item for, add L941685-S1item for this sense (P5137)urine (Q40924)subject has role (P2868)Breton verb-noun ambiguity to the only sense normally found. There you have your model. Don't create a second sense merely to link to a different item about essentially the same subject with different grammar. The dual (?) grammar is in your qualifier on the lexeme side, not in the Q-item.

SM5POR (talk) 21:42, 8 January 2023 (UTC)

Oh, and I forgot: You do know that there is a property model lexeme (P11464) now, similar in intent to model item (P5869) that was introduced years ago but we are currently trying to improve (see Property talk:P5869#Quality of model items for further details)? SM5POR (talk) 20:41, 8 January 2023 (UTC)

Only now did I discover has semantic argument (P9971) which does part of what I want to achieve with item for this sense (P5137) and subject has role (P2868). I still need to think this over. SM5POR (talk) 09:48, 9 January 2023 (UTC)

@VIGNERON: In light of this, I'd like to retract my specific suggestion above and refer you to the property has semantic argument (P9971) with documentation at Wikidata:Lexicographical data/Thematic relations instead. I still se no reason to create a second sense for each Breton verb. Distinct senses are meant for homographs belonging to the same lexical category but having different (maybe even mutually unrelated) semantics, not for different grammatical aspects. --SM5POR (talk) 18:06, 10 January 2023 (UTC)

@SM5POR: that's a lot of comments, thanks. I'm not sure how to answer them all.

For the Q-item, as I said, it seems to be verbal noun (Q1350145) (or something close enough, I could create a subclass for Breton but is it really a good idea? and before anything, I need to compare more grammar books to better reference it, Breton is a highly unstandard language).

It's not some verbs, all verbs in Breton behave like this.

For « Don't create a second sense », it's indeed probably a good idea.

An other question is separate lexemes or not, it seems that dictionaries sometimes see only one word and sometimes they see two (not sure if there is a logic here, probably depend on how often a noun verb is use on one of its value, I'll also look this up).

That sounds like a good idea but I'm not sure to understand how you want to use has semantic argument (P9971) here (especially not inside only one sense), could you show us? for instance on sandbox (L123) (+ ping @Mahir256: who know well this property).

Cheers, VIGNERON (talk) 19:35, 10 January 2023 (UTC)

I retract also my suggestion to create another Q-item until I understand better how has semantic argument (P9971) is meant to work. If this issue indeed pertains to all Breton verbs (as I suspected), then a separate qualifier seems even less appropriate, and we need to compare verbs from different languages with the Breton ones to find out whether there is a modeling problem in the first place.

I'm at a disadvantage as Google Translate doesn't support Breton. You may however be able to help me by identifying Breton lexemes (actually senses, to be precise) that are close translations of the following English ones:

--SM5POR (talk) 20:24, 10 January 2023 (UTC)

@VIGNERON: Here is a short section on en:Breton language#Verbal aspect. Does it relate to the verbal noun form we are discussing, or is it something else? Is there anything missing that you would like to see included in this description? --SM5POR (talk) 09:25, 11 January 2023 (UTC)

@SM5POR: build (L16-S1) is sevel (L630460-S3) and building (L3870-S1) is either sevel (L630460-S3) or savadur (L942312) (to simplify, not mentionning synonyms).

Verbal aspect is something entirely different (it's conjuguation, a bit like "I do"/"I'm doing" in English).

In a nutshell, what I want to introduce here is that the "infinitive/verbal noun" of a verb is also a noun in Breton.

Cheers, VIGNERON (talk) 17:55, 11 January 2023 (UTC)

Thank you. Looking at the verb, I see that it has only two forms sével (L630460-F1) and sevel (L630460-F2) listed so far, and they are both indicated as "infinitive". I guess "é" vs "e" means they differ in pronunciation. Do they differ in some other respect as well, such as time, case or person? You said earlier that the infinitive serves as a noun; is that true for both of these forms, or just one of them?

I assume there are other forms as well that are not infinitive, such as past time or imperative; you don't have to list all of them, but just one or two more forms would be helpful as illustration. The English verb build (L16) lists five, but I'm not convinced those are all that exist. The corresponding Swedish verb bygga (L38399) lists nine and the Norwegian ("nynorsk") bygga (L743415) lists 15...

But the important Breton form here is the infinitive, if I got that right? Can you explain what becomes a problem if sevel (L630460) remains listed as a verb only? Does it confuse statements about grammar or word order, and can you give an example of such a statement? Because that's the problem I suppose has to be solved. SM5POR (talk) 20:19, 11 January 2023 (UTC)

@VIGNERON: By the way, this discussion is growing pretty long, and I'm concerned it may be of limited interest to the lexicographic editor community at large, unless, as you mentioned, there could be similar problems in other languages such as Arabic? We don't have to occupy this forum, as there are over 100 million items with potential talk pages, and this topic would probably be suitable for Talk:Q23541305 (which may be created as the conjugation in Breton (Q23541305) talk page). But I won't decide on my own; that will be up to you and others here to suggest. --SM5POR (talk) 20:57, 11 January 2023 (UTC)

I'll stop here for now and wait for other point of view, and still mark the verbal noun as infinitive for now (and no, infinitive don't have time, case or person in Breton, AFAIK no language has flexion for infinitive based on that aspects, that would defeat the point of "infinitive" ; and yes, there is hundreds forms for verbs in Breton, I may add them later). Cheers, VIGNERON (talk) 17:20, 13 January 2023 (UTC)

@VIGNERON Perhaps there are categories other than "infinitive" or "verbal noun" that may fit better? Punjabi verbs do not have an obvious infinitive form, and they have multiple forms which function as verbal nouns, so I use more specific values to differentiate them. The citation form varies and the one used on lexemes is actually not the most prevalent, so they just use the form most similar (but not equivalent) to the citation form of Hindustani verbs (not my decision).

For example, ਲੰਘਣਾ/لنگھݨا (L942901-F9) is a gerund (one type of non-finite verbal noun) which has no value for aspect. Verbal nouns in Punjabi do inflect for case just like "regular" nouns but the main thing that would disqualify them from having separate noun lexemes is that they do not have a gender. ਲੰਘਣ/لنگھݨ (L942901-F10) would be the oblique case form of the gerund, which is the most common form used as head words in dictionaries. (The Arabic word for Semitic roots, مصدر is used to describe this in books in the language, despite the irrelevance of this term to Punjabi. In Urdu/Hindustani it is used to mean "infinitive" but applied to forms which are neither the infinitive nor the equivalent of the Punjabi مصدر.)

ਲੰਘਿਆਂ/لنگھیاں (L942901-F13) is a different type of non-finite verbal noun, the absolute construction, which does have aspect, and no longer has case. (Locative case inflections of these were part of Old Punjabi so I add them for historical interest as in ਲੰਘਿਐ/لنگھِئَے (L942901-F8)).

Maybe neither "absolute construction" or "gerund" fits and there is a source that mentions a different type that is a good fit for Breton but looking into articles about those topics could lead to more information in that direction. عُثمان (talk) 22:13, 14 January 2023 (UTC)

Loanwords that have been adapted into a language with slight variations. Same lexeme?

For instance ホッチキス (L943113) is an adaption of Hotchkiss (L943117). But there seem to be no strict rules on how words have to be adapted so jmdictdb lists two spellings/pronunciations:

ホッチキス；ホチキス

I think cases like this should be treated like a spelling variant (Q115819543), even though it is more of a spelling+pronunciation variant. Hence it should be one lexeme with two forms. the most common variant should be used as a lemma. optinions? -- Loominade (talk) 09:13, 13 January 2023 (UTC)

This is related to how we should document the origin of a word. I suppose derived from lexeme (P5191) is meant for this. I see at least a theoretical problem here, as it has data type lexeme, and is constrained to be placed on a lexeme only. But loanwords usually carry semantics, not just their lexical similarity, and I have a case of the Swedish word sky (L593197) where the sense sky (L593197-S1) (juice) is derived from French jus (L24814-S1) and sky (L593197-S2) (heaven above) from English sky (L3632-S1) (I think; it sure looks that way although the pronunciation in both senses is like the French word).

This isn't necessarily a practical problem as I can place both origins on the lexeme base with subject sense (P6072) and object sense (P5980) qualifiers. Also, I think these should perhaps even be different lexemes in Swedish, since the plural forms (which have been excluded here) probably differ ("skyer" vs "skyar"; I hardly ever find reason to use these words in plural).

Swedish has also imported the word juice/jos (L406892-S1) from English which in turn imported it from French, so it's related to sky (L593197-S1), but since they differ in spelling, pronunciation and even sense, this makes it a more general situation where characterizing it as a spelling issue would at best be insufficient, and at worst irrelevant.

Maybe these examples of loanwords can be of help to you, I don't know. --SM5POR (talk) 13:28, 13 January 2023 (UTC)

I' sure, ホッチキス and ホチキス have exactly the same origin and semantics. It is literally the same word pronounced differently Loominade (talk) 13:48, 13 January 2023 (UTC)

I have to correct myself; it's actually the English sky (L3632-S1) that is derived from Old Norse ský (similar to Icelandic; can't find the lexemes but see Wiktionary) meaning "cloud". As this happened maybe 1,000 years ago (several centuries before the Swedish word acquired its additional sense from French) and pronunciation has evolved differently in English and Swedish, that's why they don't sound at all similar). --SM5POR (talk) 14:49, 13 January 2023 (UTC)

Notified participants of WikiProject Japan Loominade (talk) 09:13, 13 January 2023 (UTC)

Hello. Both staples are still used today in Japanese conversation. The Japanese Standards Association has adopted the ホッチキス as the standard document (is the above answer sufficient for once?). . Araisyohei (talk) 13:31, 13 January 2023 (UTC)

the question is: do you agree with this modelling? Loominade (talk) 13:53, 13 January 2023 (UTC)

Newspapers use "ホチキス". This is because that is the way the glossary for journalists is written. It has not been determined which is the nominal form. Afaz (talk) 00:34, 14 January 2023 (UTC)

Also, some forms use spelling variant (Q115819543) as a grammatical feature. This seems to be a smart idea to me. How about we create an item for adaption variant (name tbd) and do the same? -Loominade (talk) 09:24, 13 January 2023 (UTC)

That could actually work for Swedish too, as juice/jos (L406892-S1) is sometimes spelled "jos" with no change in history, semantics or pronunciation. It will require listing 16 forms of the noun though, rather than just eight. Hypothetically, if there were to be three (or more) spelling variants, how would you mark that? If both were marked "adaption variant" you couldn't automatically tell which forms belonged to the same variant. SM5POR (talk) 15:12, 13 January 2023 (UTC)

@Josve05a, @Belteshassar, @Kriomet, @Moonhouse, @So9q, @Fnielsen, @Loominade, @Araisyohei: However, since the pronunciation is the same (as far as I know) and only the spelling differs in my case, I added the spelling variant to each of the eight forms (and the lexeme base lemma) instead, using "sv-x-Q1033526" (for phonemic orthography (Q1033526)) as the language code. I hope this doesn't violate some written or unwritten best practice guideline (now pinging a number of editors having a part in the history of this particular lexeme or the alternate spelling issue in general; my apologies if this is old stuff to you but I was actually surprised that "jos" hadn't made it into a lexeme yet).

Come to think of it, couldn't you also (in the Japanese case above) add multiple pronunciations to the same form using the variant language code as the value of some distinguishing qualifier, rather than duplicating all the grammatical features on two "forms of one form"? --SM5POR (talk) 20:00, 13 January 2023 (UTC)

we apply pronunciation audio (P443) and IPA transcription (P898) to forms. I think there should be only one possible pronunciation per form, if not qualified with a dialect. also ホッチキス and ホチキス are both ja-kana. There is no special set of rules behind it like in sv-x-Q1033526. – Shisma (talk) 09:47, 14 January 2023 (UTC)

I do not think this is a great idea--there is nothing grammatical about spelling variation. I am also hesitant to over-emphasize spelling when this is not really relevant to the spoken language that lexemes are also meant to represent.

I would say the best thing to do is figure out why there is a variation in spelling and/or pronunciation and demonstrate that in some way, either with an additional custom language code representation or a form qualified with "variety of form" or similar. I realise that this is not always easy to do, but even small differences in how a loan is adapted can have quite specific reasons. عُثمان (talk) 21:51, 14 January 2023 (UTC)

What I've seen @Jon Harald Søby: do regarding Bokmål and Nynorsk lexemes is to have separate lexemes for different spellings whenever that spelling pervades the lexeme's paradigm: all right (L984735) and ålreit (L451802); ålreit (L984737) and all right (L984736). Mahir256 (talk) 17:20, 16 January 2023 (UTC)

there is nothing grammatical about spelling variation.

That is true. If anything it should be a statement of the form. I just discovered: alternative form (P8530). This should be used instead I guess.

I would say the best thing to do is figure out why there is a variation in spelling and/or pronunciation and demonstrate that in some way, either with an additional custom language code representation or a form qualified with "variety of form" or similar.

I don't think there is a better reason than: somebody thought hochikisu sounds closer to hotchkiss than hocchikisu does 🤷. Why not using forms in this case? -- Loominade (talk) 08:58, 17 January 2023 (UTC)

@Loominade Using separate forms with alternative form (P8530) if that is the case or even if the reason is uncertain.

I am not familiar with Japanese phonology, but in the languages I have been adding lexemes in the tendencies of loan variants are regular enough that you can reliably identify their origin in many cases. All I mean is situations like this: the English word "data" has been loaned as both "ڈیتا" (DeTa) and "ڈاتا" (DATa) into Punjabi. However, the long vowel in ڈاتا dATa is out of the ordinary for English to Punjabi loans, regardless of the variation in English pronunciation. We see "bat" become "beT" and we see "plate" become "paleT." However, the one place you do see long vowels in loan words in Punjabi is in loans from Hindi/Urdu. So very likely the version ڈاتا "DATa" is a Hindi loan rather than a direct English one. We see further evidence in the fact that Urdu uses DeTa and Hindi and Punjabi writers in India use DATa. This is because despite Urdu and Hindi being registers of the same language, Punjabi is the language of the majority in Pakistan where Urdu is used, but Hindi is the language of the majority in India. That is why "Wikidata" is spelled differently in the Pakistani and Indian versions of the Punjabi translations even though it is the same language. I would be inclined to create separate lexemes in situations like this to demonstrate which forms are a result of indirect borrowing. عُثمان (talk) 19:21, 17 January 2023 (UTC)

Glosses from online dictionaries and compliance with copyright rules

I have a question regarding copyright of some data that can be added to Wikidata lexicographical data.

Some online dictionaries show the common messages "All rights reserved" or "© Copyright 2023" at the bottom of their website and they don't mention compliance with CC licenses anywhere. Some of such dictionaries are:

DLE online dictionary (Q106170961). Their terms of use are written in Spanish and can be found at this link.
Longman Dictionary of Contemporary English Online (Q115773557). Their terms of use are written in English and can be found at this link.
Merriam-Webster (Q585329). Their terms of use are written in English and can be found at this link.

Is this enough to say that the glosses that they show for words can't be added as gloss quote (P8394) (even when stating the source of those glosses in the references)?

-- Rdrg109 (talk) 00:34, 20 January 2023 (UTC)

For the record, I have asked the same question in the Telegram group of Wikidata Lexicographical data (direct link to message). -- Rdrg109 (talk) 00:36, 20 January 2023 (UTC)

@Rdrg109: if it's copyrighted indeed you normally can't copy them. See also m:Wikilegal/Lexicographical_Data. That said, it depends on the scale, a mass import is obviously a big no but a one time short quote is probably ok (under right to quote (Q2083900) exception).

Besides copyright issue, I think an important question is the worthof the quote in itself, does a specific quote really bring something to the lexeme where it's used? (something more than eg. a link to said quote or a similar quote from an other source).

Cheers, VIGNERON (talk) 09:35, 28 January 2023 (UTC)

Japanese adjective modeling

hey, where should I put the information that 小さい/ちいさい (L33484) is an い-adjective? should I put it as the lexical category? is it a paradigm class (P5911) or should I add it as instance of (P31)? what is the item for い-adjective? I was only able to find adjectival noun (Q1091269) 🤷

Notified participants of WikiProject Japan – Shisma (talk) 14:48, 30 January 2023 (UTC)

All Japanese lexemes whose lexical category is adjective (Q34698) (形容詞) are "い-adjectives". Adjectives in other languages are equivalent to both 形容詞 (adjectives; い-adjectives) and 形容動詞 (adjectival nouns; な-adjectives) in Japanese. --Okkn (talk) 04:24, 31 January 2023 (UTC)

thanks for the clearification. So for instance:

小さい/ちいさい (L33484) → Lexical category: adjective (Q34698) (い)

好き/すき (L307232) → Lexical category: adjectival noun (Q1091269) (な)

– Shisma (talk) 17:51, 31 January 2023 (UTC)

Wikidata talk:Lexicographical data/Archive/2023/01

Contents

Lexeme to merge

Lexemes outside their usual category

Loanwords that have been adapted into a language with slight variations. Same lexeme?

Glosses from online dictionaries and compliance with copyright rules

Japanese adjective modeling

Navigation menu

Wikidata talk:Lexicographical data/Archive/2023/01

Lexeme to merge

Lexemes outside their usual category

Loanwords that have been adapted into a language with slight variations. Same lexeme?

Glosses from online dictionaries and compliance with copyright rules

Japanese adjective modeling

Navigation menu

Search