Wikidata talk:Lexicographical data/Archive/2018/07

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

incomplete lexemes

That would be great to have tool to easy find incomplete lexemes per language: with no forms and (later when ready) with no senses. Currently many lexemes are created just with language and lexical category without any form. It would be easier to maintain and complete them with such a tool. KaMan (talk) 07:24, 2 July 2018 (UTC)

Good idea KaMan. A simple first way to do it would be to list all lexemes under a certain size, under 500 bytes for instance. Later, tools, constraints and queries will be needed. Cdlt, VIGNERON (talk) 11:09, 2 July 2018 (UTC)

I guess the question is how much redundancy we want. At this stage, it might be interesting to link the corresponding items for nouns and possibly add a statement about its declension. Eventually at least one form should be filled by the software (when development gets there).
For verbs, I think the interface isn't really suitable. Once there are a series of forms, it's hard to see what's missing, e.g. Lexeme:L47.
--- Jura 11:37, 2 July 2018 (UTC)

Lexeme's should have the ability to have Wiki-links

In the current version the lexeme feature doesn't allow interwikilinks to be added to lexemes. Given that there are some Wikipedia articles that are about individual words I think the ability to create those links is valuable. Otherwise, whenever Wikipedia has an article about a concept and another article about a name for the concept we have a mess in our ontology. With the ability to add sitelinks to interwikilinks we can also clean up problems like the human (Q5)/Homo sapiens (Q15978631) doublication. ChristianKl ❪✉❫ 15:48, 30 June 2018 (UTC)

@ChristianKl: items can have sitelink and soon it will be possible to link lexemes and items, wouldn't it solve the problem ? (and more elegantly I think, Homo sapiens (Q15978631) is not a word, it a concept that can be represented by thousands of words) Cdlt, VIGNERON (talk) 16:07, 30 June 2018 (UTC)

The problem is that both "human" and "homo sapiens" refer to the same concept. Currently, that means Albert Einstein (Q937) isn't instance of (P31) of Homo sapiens (Q15978631) and as a result it's not possible to infer that Albert Einstein (Q937) is a primate (Q7380). It leads to further questions whether the taxon in which a "human bone" exists is "human" or "homo sapiens". It's messy. The ability to link lexemes with items doesn't help at all with the problem. ChristianKl ❪✉❫ 11:23, 1 July 2018 (UTC)

@ChristianKl: First "human" and "homo sapiens" are not *exactly* the same concept (if they were, we would only have one items). But, yes, you're right, they are very close and not well managed ontology-wise right now. But I'm don't understand how linking en:Human to the lexeme "human"@en and en:Homo sapiens to the lexeme "Homo sapiens"@en will solve the situation. Plus, If we really want links on Lexemes it should be to wiktionaries, not to Wikipedias (we already have items for that). More importantly, I fail to see why you want to directly link en:Human to the lexeme "human"@en when there will probably be a indirect link: lexeme "human-S1"@en to human (Q5) and human (Q5) to en:Human. Could you explain a bit, please? Cdlt, VIGNERON (talk) 17:22, 1 July 2018 (UTC)

I didn't advocate linking en:Homo sapiens to the lexeme "Homo sapiens"@en but to link it to human (Q5).

Every Wikipedia article is supposed to be linked to exactly one object in Wikidata. This is necessary because it's the way Wikidata provides sitelinks for Wikipedia. I would like to keep that guarantee but say that sometimes that link is to an object in the Q-namespace and sometimes it's to an object in the L-namespace.

If we only link one of the two to Wikidata, we say that human (Q5) is instance of (P31) common name (Q502895). Saying that Albert Einstein (Q937) is instance of (P31) of something that's instance of (P31) a name, feels like it violates ontological assumptions.

To me Albert Einstein (Q937) feels like something that should be instance of (P31) of something that can have properties like a heart rate. feels lie

For the human/homo sapiens case this might seem like a hack. However there are articles like https://en.wikipedia.org/wiki/While on Wikipedia that are clearly about individual words. It makes sense to link articles like that directly to lexemes. The relationship between house@enwiki and house@wikitionary is not the same relationship as the relationship between while@enwiki and while@enwikitionary.

If we have items like while (Q7993606) in the Q-namespace then it's hard to explain why similar items for other words shouldn't be notable in the Q-namespace. ChristianKl ❪✉❫ 18:18, 1 July 2018 (UTC)

Oh, now I understand you better and I totally agree that the P31 of Albert Einstein (Q937) (or any individual living being, eg. Bear JJ1 (Q492389) has for P31 a taxon (Q16521)) can be problematic. When we do queries, we had to do circumvolutions to get correct results (which can be a bit speciesist BTW), but it's not undoable and it's possible to infer that Albert Einstein (Q937) is a primate (Q7380) (through human (Q5) said to be the same as (P460) Homo sapiens (Q15978631) and then through parent taxon (P171)). We may need to improve this. What I don't get is how having sitelinks on lexemes would help. while (Q7993606) is an exception (in fact, AFAIK, it's the only item about a word of a specific language), we shouldn't built the entire structures around one unique case, but true the system should take exceptions into account. Cdlt, VIGNERON (talk) 23:15, 1 July 2018 (UTC)

@VIGNERON: "only item about a word of a specific language"? Hm. How about these 173 (and probably many more on deeper levels):

SELECT ?item ?itemLabel ?lang ?langLabel 
WHERE 
{
  ?item wdt:P407 ?lang.
  { ?item wdt:P31 wd:Q8171. } UNION
  { ?item wdt:P31/wdt:P279 wd:Q8171. } UNION
  { ?item wdt:P31/wdt:P279/wdt:P279 wd:Q8171. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en,ru,de,fr". }
}

Try it!

--Infovarius (talk) 21:40, 9 July 2018 (UTC)

@ChristianKl: I believe that this discussion is out of scope of lexicographical project (because human (Q5) corresponds to many thousands of lexemes, which is much more than 2 mentioned items), and rather belongs to Wikidata:Wikiproject Taxonomy (which is more tough, I must admit). --Infovarius (talk) 21:40, 9 July 2018 (UTC)

Some thoughts on what's missing

Since I've been adding a lot of English lexemes the last few weeks, I've had some thoughts about what is still missing (perhaps we need additional properties, or something else?) - a short list (not including the obvious things like merging, searching, etc):

A way to indicate that two lexemes are distinct despite having the same label, language and category - this is for the French tour case: Lexeme:L2330, Lexeme:L2331 and Lexeme:L2332 but it happens in English too, for example lie: Lexeme:L4180 and Lexeme:L4181. Once we have merging, we need a way to be clear these are cases that should NOT be merged. In item space we have different from (P1889), so a similar property for Lexemes perhaps?
Some understanding/agreement on what it means for a single form to have multiple "Grammatical features". For Lexeme:L4180, I've indicated form L4180-F2 with the features "simple present" and "third-person singular", as it is the form when both of those conditions apply. However, for form L4180-F4 I've indicated both "simple past" and "past participle", as it is the form when either of the conditions apply. Is this ok? Can we think of this consistently as a Boolean "OR" when the grammatical features are in the same category (tense for F4) but a Boolean "AND" when in different categories (tense vs number for F2)? Should this be better formalized somehow?
Some way of indicating that a particular form is part of the language, but used rarely. For example the "thou" forms of English verbs are a part of the language but used only very rarely - should we add "liest" as a form for Lexeme:L4180? How should it be indicated that this is a rare form? What about lexemes that are rare as a whole, perhaps we need some kind of frequency measure attached to lexemes and/or forms?
It might be helpful to indicate that a lexeme can have multiple categories with essentially the same meaning. This is true for a number of comparable adjectives in English that have the same form as adverbs - can we indicate they are both adjective and adverb at once? Or adjectives which are also nouns, nouns which are also adjectives, etc. The Oxford English Dictionary often lists multiple lexical categories on a single entry. Is there some way of doing this here (maybe this has already been discussed previously here??)

That's all for now - comments appreciated! I may mull on this a bit more and propose another property or two, if I don't hear better suggestions.... ArthurPSmith (talk) 01:07, 2 July 2018 (UTC)

@ArthurPSmith:

as for indicating that two lexemes should not be merged we actually have property homograph lexeme (P5402) which can partially cover markup for this problem.
as for multiple grammatical features I've thought that there should not be "OR" relation. Separate forms should be created for such cases even if they are identical. If we could have ability to add example sentences of usage of the form in the future then each example would need to be assigned according its grammatical features. With "OR" this assigment could be ambiguous.
as for rare forms we have already solution for it. Use instance of (P31) with rare form (Q55094451) in declarations of form which was crated exactly for this purpose (you can find more in Template:Lexicographical properties). We can attach frequency measure with qualifiers and add referenced source for this claim. There is also case when one of senses of lexeme is rare.

KaMan (talk) 06:48, 2 July 2018 (UTC)

@KaMan: Thanks, I hadn't noticed that line in the template (I thought it was just about properties, but that's very useful thanks!). On the "OR" question - I would think "example sentences of usage" would be for senses, not for the forms. I do think the way I've been doing it is logically consistent and sustainable. Under your approach, for the English verb "put", you would enter it 3 times then, one for simple present, one for simple past, one for past participle? Also I'm not sure what under this suggestion would be the right way to handle most English verbs that have just one form for all but third-person singular - I've been assuming that just putting "simple present" there is sufficient, with the "simple present" + "third-person singular" on the other form as an overriding condition, is that your understanding also? ArthurPSmith (talk) 14:28, 2 July 2018 (UTC)

@ArthurPSmith: yes, for "put" (put (L4464)) there should be 3 forms identical in spelling but with different grammatical fatures. On this assumption works Wikidata Lexeme Forms tool. It always creates separate forms regardles identity. For extreme example in Kot (L2876) there is one spelling "Kot" repeated 14 times as forms with different grammatical features. And that's the way how it is described in external resources. And it's not something I've come up with myself. I read about describing forms this way here, but I don't remeber thread. KaMan (talk) 07:14, 3 July 2018 (UTC)

@KaMan: I really don't believe that's the correct approach to word forms. Every source I can find, for example here suggests that by "word form" is meant a particular "shape" of a lexeme, and there's only one form per shape. If two forms are spelled the same but pronounced differently, I can see that being a reason to have two entries rather than one (for example English "read" in present tense vs past). But otherwise it seems to me it's a single word, i.e. the same set of letters between two spaces, pronounced the same, it doesn't make sense to have multiple entries. Quoting from the above reference - "The point about "crown", for example, is that as a transitive verb it would get one entry despite the existence of four different shapes in which it appears: crown, crowns, crowned, crowning. These different shapes spell out word forms that belong to the verb lexeme crown." That's 4 word forms, not 5 or more that distinguishing grammatical tenses would require. ArthurPSmith (talk) 14:31, 3 July 2018 (UTC)

@ArthurPSmith: That's four shapes; the source doesn't state how many forms they can represent. Actually, I can't see there any statement or implication that generally there's only one form per shape. On the contrary, in the end of the article the author mentions a possibility of distinct word forms that have the same shape (... while he says at the same time he won't count them for the specific purpose of finding the English word with most "forms"). I think his approach is understandable for English and other analytic and isolating languages which have (if any) a rather limited possibility of inflection, but it is not so suitable for other types of languages.--Shlomo (talk) 06:19, 4 July 2018 (UTC)

@ArthurPSmith: ok, so take a look at this reallife example with marchew (L5595). There is form "marchwi" repeated a few times. Let's take two:

L5595-F3 "marchwi" grammatical features: singular (Q110786), dative case (Q145599)
L5595-F9 "marchwi" grammatical features: plural (Q146786), genitive case (Q146233)

With your "OR" version of grammatical features together with single occurence of form there would be form:

grammatical features: singular (Q110786), plural (Q146786), dative case (Q145599), genitive case (Q146233)

How can one query what were the original features from this "or"ed list? KaMan (talk) 11:54, 4 July 2018 (UTC)

Hmm, I suppose in a case like that I would advocate to used combined features - "singular dative" OR "plural genitive". But I see there are issues here. Perhaps the better solution would be in the English cases with the same "simple past" and "past participle" form to use a combined feature there, as it's quite common. Maybe there's something already defined for that? I'm not a linguist! If anybody has a proposal for clearly explaining how we should use the "grammatical feature" aspect of forms I'd love to see it! ArthurPSmith (talk) 14:19, 4 July 2018 (UTC)

@ArthurPSmith: another example when there is need for two identical forms is when one of them instance of (P31) rare form (Q55094451). There is no way to assign claim of form to one of grammatical features in "OR"ed version. I had today such example in coś (L5916) where forms L5916-F1, L5916-F3 and L5916-F5 are identical but only form L5916-F3 needs markup as rare form. KaMan (talk) 12:51, 5 July 2018 (UTC)

@ArthurPSmith: As for Example sentences of usage would be for senses, not for the forms: Not necessarily. It's true when we use the example to precise the sense (and to distinguish it from a similar sense or even other lexeme). Sometimes, however, we need an example to show the difference in using different variants of an inflected form, and in that case the right place for the example would be in the forms section.--Shlomo (talk) 06:38, 4 July 2018 (UTC)

New tool: graph builder

Yesterday evening I spent some time assembling the etymology of L129, and then I wanted to see the result graphically, so I hacked together a version of the Wikidata Graph Builder that works for lexemes: the Wikidata Lexeme Graph Builder. (It actually supports items and properties as well, but for those you might as well use the Wikidata Graph Builder, with its extra features and all.) The website is really rudimentary (I’ll add a title, instructions, etc. later), and it only supports forward searching, but it’s enough for simple graphs (start from one or more entities and follow statements of a certain property). --Lucas Werkmeister (talk) 10:23, 29 May 2018 (UTC)

@Lucas Werkmeister: what about inverse properties graph? (If I want to know all words derived from some Proto-European lexeme...) --Infovarius (talk) 14:35, 9 June 2018 (UTC)

@Infovarius: You can find those lexemes using Special:WhatLinksHere, but I don’t plan to integrate that into the tool. --Lucas Werkmeister (talk) 20:05, 10 June 2018 (UTC)

@Lucas Werkmeister: It's up to you, of course. But generally it seems more interesting to see a tree than a line (German nouns like L129 is an exception I suppose). And yes, I ask for an upgrade of the tool :) --Infovarius (talk) 09:55, 12 June 2018 (UTC)

@Infovarius:

Done, turned out to be not so difficult to implement after all :) here are the words derived from *uber, for example: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L2087&predicate=P5191 --Lucas Werkmeister (talk) 17:54, 15 June 2018 (UTC)

Great! Thank you! --Infovarius (talk) 12:46, 16 June 2018 (UTC)

@Lucas Werkmeister: would it be possible to have several properties at the same time? For instance for having both derived from lexeme (P5191) and combines lexemes (P5238) (since Aberystwyth (L4730) derived from lexeme (P5191) Aberystwyth (L4729) and Aberystwyth (L4729) combines lexemes (P5238) aber (L4732)+ Ystwyth (L4731) would do a nice graph on https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L4730&predicate=P5191P5238 ). Cdlt, VIGNERON (talk) 07:06, 27 June 2018 (UTC)

@VIGNERON: sure, should be possible. I’ve filed #7, but I won’t have the time to implement this for a few days at least, unfortunately. --Lucas Werkmeister (talk) 20:58, 29 June 2018 (UTC)

@VIGNERON: Done, you just have to separate the property IDs with a comma (just like for the entity IDs), so the correct link for your example is https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L4730&predicates=P5191,P5238. --Lucas Werkmeister (talk) 13:02, 4 July 2018 (UTC)

@Lucas Werkmeister: wonderful, a small suggestion, could it have different colours and/or textures for different properties? Cdlt, VIGNERON (talk) 11:45, 5 July 2018 (UTC)

@VIGNERON: better now? --Lucas Werkmeister (talk) 15:30, 8 July 2018 (UTC)

@Lucas Werkmeister: yessss ! now I just have to add more lexemes to have a nice graph. Cdlt, VIGNERON (talk) 17:06, 8 July 2018 (UTC)

@VIGNERON: try this: https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L5573&predicates=P5191%2CP5238 KaMan (talk) 17:16, 8 July 2018 (UTC)

Minor update: support for forms should be much better now – see e. g. https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L123&predicates=P5188. --Lucas Werkmeister (talk) 14:50, 4 July 2018 (UTC)

@Lucas Werkmeister: could it be possible to automatically scale graph so it fits screen? I've just created https://lucaswerkmeister.github.io/wikidata-lexeme-graph-builder/?subjects=L6291&predicates=P5191 based on wiktionaries and cannot fit it in view area. KaMan (talk) 12:38, 10 July 2018 (UTC)

@KaMan: sorry, I’m not sure what the problem is – can you perhaps take a screenshot? (And does it help if you zoom out and then reload the page?) --Lucas Werkmeister (talk) 14:02, 10 July 2018 (UTC)

@Lucas Werkmeister: yes, zoom out and reload solved the problem, thanks. KaMan (talk) 07:48, 11 July 2018 (UTC)

misspelling (Q1984758)

@VIGNERON: and others. Are forms supposed to contain wrong forms like in aurochs/auroch (L5143)? KaMan (talk) 13:25, 1 July 2018 (UTC)

Hi KaMan, thank you for asking.

First in general, it depends on what we call « wrong ». An obvious mistake made just one time should not be on Lexeme but what about a very common mistake ? (so common that sometimes the error is more common than the right form) I see no reason to not have a common form, no matter if it's right or wrong.

Then, for the specific case of "aurochs", it's especially tricky: it was consider by most dictionnaries to be a misspelling (Q1984758) (and still is) but because it's was so common, since 1990 and the orthographic corrections of French in 1990 (Q486561), it's not really a misspelling (Q1984758) any more (but the 1990 is kind of optionnal and most French speaking people don't know about it so "aurochs" is still the ). I have no precise idea on how to model that... (it's probably a bit too complex for the system), that's why I chose (choosed

) a simple way which is not perfect, I'm open to any suggestion.

Cdlt, VIGNERON (talk) 15:09, 1 July 2018 (UTC)

@VIGNERON: what about new property for forms "commonly misspelled as" with text value? KaMan (talk) 16:08, 1 July 2018 (UTC)

KaMan why not but why? I see no need for a property and I can see some disadvantages, like making harder to query forms if not all forms are stored in forms. Cdlt, VIGNERON (talk) 17:08, 1 July 2018 (UTC)

It is an issue of broader scope than just misspelling (Q1984758). The forms can also be "wrong" due to their incorrect inflection, vocalization, accentuation, capitalization etc. Besides, as VIGNERON wrote, the "wrong" qualification is not a simple boolean value, but it has many colo(u)rs and shades. Some languages have a normative grammar that says what's correct and what's wrong. Other languages have just recommendations, and yet other ones only descriptions what is used in which layer of the language. Also, the "wrongness" can be limited to certain context (region, time, style, even sense) while in other context it is considered correct. For these reasons I consider creating a separate record for the "wrong" form and describe the type and the scope of it's wrongness using appropriate statements to be a good solution for most cases.--Shlomo (talk) 06:26, 2 July 2018 (UTC)

@VIGNERON, Shlomo: Ok, I'm convinced to use separate forms but I feel uncomfortable that misspelling (Q1984758) is placed among grammatical features. I think it would be better to use instance of (P31) in form declarations with qualifiers describing "wrongness" and with references to the external statements where it is stated that's "wrong". KaMan (talk) 07:09, 2 July 2018 (UTC)

I agree that a statement in the form section is a better practice. I'm not sure about using instance of (P31), I think we need a specific property for this. Maybe several ones.--Shlomo (talk) 05:34, 11 July 2018 (UTC)

@VIGNERON: I've changed aurochs/auroch (L5143) so it now contains reference to orthographic corrections of French in 1990 (Q486561) (thought I'm unsure applies to part (P518) is most suitable here). KaMan (talk) 12:43, 4 July 2018 (UTC)

@KaMan: IMHO it isn't, though I'm not sure which one is. Maybe determination method (P459), statement is subject of (P805) or start time (P580)? I'm not 100% happy with any of them. Or maybe let's make it a reference and link it through stated in (P248) to a specific publication.--Shlomo (talk) 05:34, 11 July 2018 (UTC)

@Shlomo: I agree that something better has to be choosen, but I know nothing about nature of orthographic corrections of French in 1990 (Q486561) so I think @VIGNERON: is better situated to choose right markup. KaMan (talk) 07:44, 11 July 2018 (UTC)

Another issue: If the misspelling (or other attribute) is related to the word's stem, should it be stated in the lexeme section, or in the form sections of every single form? Or maybe in both of them? Also, should a word with a commonly misspelled lemma have a separate lexeme, or should the "wrong" lemma be given as an additional representation (with corresponding code)? Or should we have only the "correct" lemma, and the misspelled variants just as forms?--Shlomo (talk) 05:51, 11 July 2018 (UTC)

@Shlomo: I would say it depends on how common is the mistake. But yes, in theory, the mistake should be on all forms. If the mistake is very common, it could be in specific lexeme (like common mistake have their own Wiktionnary entry, see fr:wikt:aréoport for aéroport). For me, the goal is having all words of the world in Lexemes, no matter how "correct" their supposed to be (especially as "correct" can be a tricky and subjective point of view). Cdlt, VIGNERON (talk) 08:20, 11 July 2018 (UTC)

@VIGNERON, Shlomo: I've created property proposal in relation to this disscusion: Wikidata:Property proposal/correct form KaMan (talk) 15:12, 11 July 2018 (UTC)

@KaMan: thanks, a property could be useful but I'm not sure exactly if it a good idea and how it should be used... (to be discussed on the proposal). Cdlt, VIGNERON (talk) 15:40, 11 July 2018 (UTC)

present participle (Q13923816), present participle (Q24133704), and present participle (Q24577575)

Is there some difference between them, or should they be merged?--Shlomo (talk) 06:01, 9 July 2018 (UTC)

I'd be in favor of merging, but the Danish labels on the last two seem to differ, so maybe a Danish speaker should look at this? ArthurPSmith (talk) 13:47, 9 July 2018 (UTC)

@Fnielsen: Mahir256 (talk) 14:13, 9 July 2018 (UTC)

I have already posted the question here. Repeating: 'It is unclear whether these two items should be merged. The Danish article talks about general "present participle" in Danish and English, while the Turkish and Russian (AFAI can read) talks about English only.' The Danish article that links to present participle (Q24577575) describe mostly the Danish verb form (with -ende postfix), but also briefly mentions the English verb form (with -ing postfix). Both the Turkish and the Russian Wikipedia article linked to present participle (Q24133704) seem to only describe the English form, while present participle (Q13923816) seems to be only about Dutch. It wouldn't mind me to merge them. I do not know if the Dutch agree with that. In Danish, "lang tillægsform" = "Nutids tillægsform" = "præsens participium" [1] and these are also similar to "lang tillægsmåde" = "nutids tillægsmåde" — Finn Årup Nielsen (fnielsen) (talk) 19:38, 9 July 2018 (UTC)

And what about this one: present participe in French (Q1763348)? That seems to be about French only. — Finn Årup Nielsen (fnielsen) (talk) 19:45, 9 July 2018 (UTC)

Yet another: present active participle (Q430255). Shouldn't both present active participle (Q430255) and present participe in French (Q1763348) use "subclass of" property rather than "instance of" class? — Finn Årup Nielsen (fnielsen) (talk) 19:47, 9 July 2018 (UTC)

I have now taken the liberty to merge present participle (Q13923816) and present participle (Q24577575). — Finn Årup Nielsen (fnielsen) (talk) 19:48, 9 July 2018 (UTC)

@Fnielsen, Shlomo: Thanks! The Russian link was only to a redirect to an article English grammar; the Turkish article was English-specific, but I think the idea is really the same general one that the Danish and Dutch articles cover, so I have merged those two items. I also modified the French and Latin ones to be subclasses rather than instances; they do seem to be somewhat distinct concepts. However - I wonder now if we have a bot that can fix the links to present participle (Q24133704) in the grammatical features of Lexemes? ArthurPSmith (talk) 13:05, 10 July 2018 (UTC)

derived from lexeme (P5191)

This property takes lexem. I think it's wrong. It should take form of the lexem as input. There is often situation where word comes from inflected form rather than basic form. For example I linked marchew (L5595) with derived from lexeme (P5191) to *mrchy (L5600) but in fact it should be linked to form L5600-F2 with accusative singular. Are we able to link to forms or should it be requested with phabricator? KaMan (talk) 12:16, 3 July 2018 (UTC)

You would have to propose a new property ("derived from form") for this case, I think. Phabricator wouldn't help, it's something we can decide here on our own. ArthurPSmith (talk) 14:19, 3 July 2018 (UTC)

@ArthurPSmith: Ok, I have proposed new property but what I was reffering to with phabricator is that we do not have in interface suggester to pick up form. There seems to be only picker to lexemes. KaMan (talk) 12:26, 4 July 2018 (UTC)

Form selection does work for form-valued properties. If you try the sandbox item Lexeme:L123 you will see the Sandbox-Form property where you can try this out. ArthurPSmith (talk) 14:14, 4 July 2018 (UTC)

Thanks, that's exactly what I was looking for. KaMan (talk) 12:41, 5 July 2018 (UTC)

@KaMan, ArthurPSmith: I agree, that form should be involve in derivation. And going a bit further shouldn't be "derive" a property between form and form? At least for the suppletive lexeme it's seems needed, except if there is an other way. I think we need to focus on the big pictures here. Cdlt, VIGNERON (talk) 14:46, 19 July 2018 (UTC)

Duplicates

Tracked in Phabricator
Task T198105

How we deal for now with duplicates? We have series of lexems from Wikimania workshop and there is (L7339), manger (L7395), (L7397) and (L7370) already described in run (L279), manger (L309), ser (L5140) and libro (L317). KaMan (talk) 13:49, 22 July 2018 (UTC)

Before creating a new lexeme you should always check whether it already exists by using the auto-complete search on lexeme-valued properties (the test lexeme sandbox (L123) has a test property for this). ArthurPSmith (talk) 23:59, 22 July 2018 (UTC)

@ArthurPSmith: There is also useful Ordia tool to detect duplicates, but the question is not how to detect them but what to do with them if they occured as shown above. KaMan (talk) 10:29, 23 July 2018 (UTC)

@KaMan: I believe we are waiting on the development team to add a "merge" capability. I have "repurposed" a few duplicates (if the same string can act as both "noun" and "verb" for example, changing the type if that is otherwise missing) but I think this is frowned upon, so we need to just leave them for now. ArthurPSmith (talk) 11:39, 23 July 2018 (UTC)

@KaMan: Is it only my impression, or does the Ordia tool really show only first seven lexemes found? If so, in short time it won't be of much use for the duplicities searchers. In fact, it already isn't in cases of some very short words...--Shlomo (talk) 19:54, 25 July 2018 (UTC)

@Shlomo: It seems You are right. I didn't know about such limitation. KaMan (talk) 06:24, 26 July 2018 (UTC)

I started adding P31=Wikimedia duplicated page. The problem is that this isn't visible when looking them up.
--- Jura 11:53, 27 July 2018 (UTC)

Given that they start adding up and external uses are somewhat limited due to the absence of query server, I'd just delete them for now.
--- Jura 12:14, 27 July 2018 (UTC)

Forms for German verbs

Since I think this topic could be interesting for other languages as well, I would prefer to keep it accessible to people who don’t understand German. However, I recognize that I can’t force everyone who might reply to use {{LangSwitch}}.

I’ve started drafting how the set of forms of a lexeme for a German noun could look like, in the form of a template for the Wikidata Lexeme Forms tool at User:Lucas Werkmeister/Wikidata Lexeme Forms/German#deutsches Verb. I’ve tried to keep the number of forms down, so I omitted everything that derives from adding auxiliary verbs to the infinitive or past participle („ich werde tragen“, „ich würde tragen“; „ich werde getragen“, „ich wurde getragen“, „ich hatte getragen“, „ich war getragen worden“, etc.); however, merging e. g. first and third person plural forms seems excessive to me. I’ve also completely skipped the present participle („tragend“, „tragendes“, „tragender”, etc.), as well as all the inflected forms of the past participle („getragenes“, „getragener“, etc.), because I’m not sure what to do with this yet, since it kinda moves into adjective territory, and I haven’t tackled adjectives yet either.

What are your thoughts on this? If you think it looks alright, I can add the template to the Wikidata Lexeme Forms tool and we can start creating some verbs with these forms, and then see how well the model fares in practice, I guess.

--Lucas Werkmeister (talk) 22:04, 21 July 2018 (UTC)

(With regard to the English translation,) the common concern regarding infinitives and past participles sounds much like those in the second bullet point of Arthur's thoughts above. Mahir256 (talk) 14:07, 23 July 2018 (UTC)

@Mahir256: thanks, I hadn’t seen that discussion; but I don’t think it’s the same case: I’m not merging two or more forms into one, and joining all their grammatical features into one set – I’m adding just one basic form, and then omitting the ones derived from it, not adding their grammatical features anywhere. If we were to add separate forms for those derived versions, I’m not sure what their representations should be: „er wird tragen“ is a complete sentence in future tense, but so is „Tim wird tragen“ (using a name instead of a pronoun), so clearly „er wird tragen“ can’t be the representation for 3rd person future tense. But should it be „wird tragen“ (with auxiliary verb) or just „tragen“?

I'm afraid there are a few more issues.

The hint sentences of the type "Ich [trage]" are in fact not very helpful. By the "ich" the editor knows, there should come 1st person singular, but has no clue if present or past and if indicative or conjunctive.
In addition to the participles (I & II) there should be also gerundives ("zu tragende"); all of them in weak, strong and mixed declination, the participles also undeclined. The good part is, we probably won't need declined forms of Partizipium II ("past participle") and Gerundivum for intransitive verbs, but even declinated forms of Partizipium I would make a quite long page which doesn't provide an easy survey. I guess nobody will care to fill out all of the demanded forms unless there would be at least some automated tool which would prepare the forms and let the editor check and correct it. The same btw would probably be the case of adjectives. Still, this is only a technical issue, which will be probably solved with allowing robotic import.
I support the idea of omitting compound forms created with help of auxiliar verbs. Still I prefer if there would be a broader discussion about it. This problem is not only in German, but in many other languages too, and I can imagine, there could be good reasons for inclusion of the compound forms in some of them. The basic question is: do we want to have a unified system (as much as possible) all through the lexicographical Wikidata, or is it OK that every language has a different approach to this (and possibly others) issues? BTW, omitting compound forms doesn't always guarantee small number of forms...
What about the reflexive forms (like "sich tragen")? Many languages consider reflexive variant of a verb a separate lexeme; some German grammars and most dictionaries on the other hand assume, that one verb (as a lexeme) can generally appear in three "Genera": active, passive and reflexive. Duden DUW describes only the active and passive Genus in the introduction, but subsumes the reflexive forms under the "non-reflexive" lexemes anyway. To make the situation more complicated, there are also so called "echte Reflexivverben", which appear exclusively in reflexive forms (like "sich schämen"). I'd propose not to include the reflexive forms into the database and to treat them like other compound forms (although the "mich", "dich", "sich" etc. aren't auxiliar verbs stricto sensu...) In this case, it should be indicated somewhere, whether the verb creates reflexive forms or not. Probably there should be a similar indication for both of the passive forms (Zustandspassiv & Vorgangspassiv). I'm not sure what to do with the "reflexive-only" verbs, though.

That's for now, I think, there's enough to discuss already ;) --Shlomo (talk) 22:08, 25 July 2018 (UTC)

@Shlomo: oof, that’s a lot to think about… my general approach is to start slow and avoid some of these issues for now, I’m afraid :)

re 1: I disagree, since you also see the placeholder („trage“ in your example), which tells you the tense (as long as the placeholder verb isn’t one with an excessive amount of homographs).

re 2: I’m sorry, I don’t quite follow your point here… you’re saying that I should add additional forms, and then arguing that this will probably mean no one will bother filling them all in? But that’s why I don’t want to add them in the first place :)

I assume we’ll eventually have tools to automatically generate these forms, but I don’t want to go too far ahead with that, so for now I prefer to have the user fill in a subset of the forms manually. We can always add the rest later, when we have more experience and better tooling.

re 3 and 4: good points, but with no immediate effect on the tool and template, if I understand your arguments correctly. --Lucas Werkmeister (talk) 11:56, 29 July 2018 (UTC)

Maybe "sein" would be a better example? That one is more irregular and has different forms for Indikativ and Konjunktiv I, unlike "tragen". - Nikki (talk) 13:34, 29 July 2018 (UTC)

@Nikki: hm, but some of those forms feel odd, especially the past participle („ich werde gewesen“?). --Lucas Werkmeister (talk) 21:54, 30 July 2018 (UTC)

Hm, I’ve kinda changed my mind on point 1 – changing „Ich [trage].“ to „Ich [trage] heute.“ or „Heute [trage] ich.“ doesn’t hurt anyone, really… it’s a fairly small addition that already makes the sentence clearer. --Lucas Werkmeister (talk) 21:57, 30 July 2018 (UTC)

I think a simple list for verbs would be hard to follow and would suggest grouping the forms by tense/mood. Then there could be an example sentence for each group and the rest is clear from the pronouns. For example, "Indicative present / e.g. ich trage es jetzt / ich ... / du ... / ..." or "Subjunctive II / e.g. Wenn ich sowas trüge / ich ... / du ... / ...". I would also preferably choose simple labels (in other words, they shouldn't be more complicated than they need to be). It doesn't help in my opinion to describe all the forms as "active" when we're not asking for the passive forms or to include "Präteritum" when describing "Konjunktiv II" when we only want the form without auxiliary verbs anyway. - Nikki (talk) 11:48, 26 July 2018 (UTC)

@Nikki: fair enough – grouping the forms would require some new tool features, but I can at least simplify the form labels. Is Special:Diff/717659127 an improvement? --Lucas Werkmeister (talk) 11:56, 29 July 2018 (UTC)

Yeah, I think that's better. I would have probably left "Präsens" in though and used "Konjunktiv I" instead of "Konjunktiv". - Nikki (talk) 13:34, 29 July 2018 (UTC)

@Nikki: good point, changed both. Feel free to edit the template yourself while it’s a draft btw ;) --Lucas Werkmeister (talk) 21:53, 30 July 2018 (UTC)

on Special:NewLexeme should Language of Lexeme no just allow to put languages?

on the page Special:NewLexeme you should on the empty page not have the possibility to type other words then languages in the language field. Indeed if you put another word there e.g. house the software recognizes it is not a language and pouts it automatically in the field language of lexeme and adds a field language of lemma for which just languages are available in the drop down list. Should this not be just the same behavior for the field language of Lexeme? or is there something i missed here? Robby (talk) 17:13, 23 July 2018 (UTC)

@Robby: Did you search the archives? Not all languages are accepted either, see e.g. Wikidata talk:Lexicographical data/Archive/2018/05#Choice of the language and Wikidata talk:Lexicographical data/Archive/2018/06#Rare and pra-languages. See fjǫrðr/ᚠᛁᚰᚱᚧᚱ (L7699) for an example of a lexeme in a language that is not accepted in the language input field. --Njardarlogar (talk) 08:04, 29 July 2018 (UTC)

Wikidata:Lexicographical data/Statistics

I have created this page.--GZWDer (talk) 13:52, 30 July 2018 (UTC)

Thanks for creating this page!

FYI, it looks like the way it's implemented will not scale with the large amount of Lexemes we'll have in the future. Hopefully later we'll have the Query Service ready to render this kind of queries :) Lea Lacroix (WMDE) (talk) 14:15, 30 July 2018 (UTC)

@GZWDer: Did you break something with {{L}} such that all uses of it are giving Lua errors? (Or maybe @Lucas Werkmeister (WMDE): might know of some backend change?) Mahir256 (talk) 14:16, 30 July 2018 (UTC)

@Mahir256: Fixed.--GZWDer (talk) 14:18, 30 July 2018 (UTC)

Can you add a link to the Lexemes? That would help to correct lexical categories like article (Q191067) (= text that forms an independent part of a publication). --Kolja21 (talk) 17:38, 30 July 2018 (UTC)

@Kolja21: Module:Lexeme/data does not include IDs of any specific lexeme. However you can try this.--GZWDer (talk) 05:02, 31 July 2018 (UTC)

Thanks, but amounts of lexemes are not always comparable because some language authors just create lexemes without any content apart lexical category and language. Could it be possible to make statistic by average size of lexeme in language? KaMan (talk) 08:18, 31 July 2018 (UTC)

Lexeme.getDetailLemma

As an attempt at advancing phab:T195382#4227355, I have modified getDetailLemma a fair amount to display both the language code of a given lemma and its lexical category in a manner somewhat consistent with the way entries in Special:Search are displayed whenever labels or descriptions not in your interface language need to be displayed in those results. See, for example, how piw ^{brezhoneg (interdialectal Breton orthography)}/piou ^{brezhoneg (academic Breton orthography)}/piv ^brezhoneg _{interrogative word} looks. The modifications are intended to make it more of an acceptable substitute for getLemma in all templates that use it, although in its current state there's much to improve and resolve, including ~~resolving ISO codes via language names present on Wikidata~~CLDR is good enough for now and the placement of the langcode and category. Try it out (using the appropriate #invoke: syntax) and modify it how you will. Mahir256 (talk) 16:40, 30 July 2018 (UTC)

Thanks, but personally I prefer showing details the way it is already done in interface: in tooltip when cursor hovers over short label of lexeme. KaMan (talk) 08:26, 31 July 2018 (UTC)

Periodic table of lexemes

I thought I will share my current work in progress on Polish lexemes of chemical elements made just for fun: wikt:pl:Wikipedysta:KaMan/pierwiastki. KaMan (talk) 17:28, 31 July 2018 (UTC)