Wikidata talk:Lexicographical data/Archive/2020/12

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Author on a definition (scientific definitions) and constraint issues

So I've been trying to apply more semantic meaning into the Lexicographical model and seeing where things work and where they break, or where we need to improve documentation. In doing so, I added 2 important Senses for "epigenetics" to cover contemporary and historical contextual meanings and tried to apply authorship appropriately, but seems to breakdown?

https://www.wikidata.org/wiki/Lexeme:L32914

Here's the source info paragraphs for those 2 Senses I added on that Lexeme: https://en.wikipedia.org/wiki/Epigenetics#Contemporary

Let me know your thoughts, and once we have some agreement, I'll improve the Lexicographical data docs. --Thadguidry (talk) 00:39, 27 November 2020 (UTC)

In regard to epigenetics (L32914), I am wondering how text we can much take from other copyrighted works. I have so far for gloss quote (P8394) kept using old books, e.g., Fremmedordbog (Q59155323) (published in 1882). I originally thought of gloss quote (P8394) as copying the definition text from dictionaries, hence the name "gloss", but I suppose one could also use definitions from body text of non-dictionary works? — Finn Årup Nielsen (fnielsen) (talk) 14:40, 3 December 2020 (UTC)

Values for lexicographic category

The lexicographic category for the lexemes is unfortunately not a property, so we have no means for property contraints on the lexicographic category. Types of values Wikidata editors use for Wikidata lexemes is diverse, see, e.g., in Ordia (Q63379419): https://ordia.toolforge.org/lexical-category/ There might be a few obvious errors, for instance, invalid ID (L346619) has the lexical category village (Q532), but most are values selected by Wikidata editors that could/should be lexical categories. I think we should clean up the number of lexical categories and use values that denotes "as general a concept as possible". For narrow annotation, we can instead use instance of (P31). For instance, I suggest:

family name (Q101352) not as lexical category. Instead use proper noun (Q147276) and move family name (Q101352) to instance of (P31).
intransitive verb (Q1166153): Instead use verb (Q24905) and move intransitive verb (Q1166153) to instance of (P31).
toponym (Q7884789). Use proper noun (Q147276) instead and move toponym (Q7884789) to instance of (P31).
Instead of infix (Q201322), suffix (Q102047), prefix (Q134830) use affix (Q62155) and move to instance of (P31).
Instead of Malayalam numeral (Q28771286) use numeral (Q63116) and move to instance of (P31).

— Finn Årup Nielsen (fnielsen) (talk) 18:40, 2 December 2020 (UTC)

It's unclear what instance of (P31) really should be doing on lexemes. Maybe we should rather stop using P31 on lexemes entirely. It does indeed overlap with lexical categories and making it some kind of optional sub-lexical category is hardly going to help.
I don't see what your amalgamations are meant to fix nor how they should fix the problem you outlined, i.e. that we only have a limited constraints for lexical categories. The solution for that would rather be to expand those. BTW, let's do away with numeral (Q63116) in lexical category. --- Jura 19:12, 2 December 2020 (UTC)

I think User:Fnielsen's suggestion is a good one, to limit the number of different lexical categories in use. With hindsight I'm not so sure that the special lexicographical features (like lexical category) are worth much relative to properties, because we have a lot more tools like constraints with properties... ArthurPSmith (talk) 21:57, 2 December 2020 (UTC)

- "It's unclear what instance of (P31) really should be doing on lexemes." Often lexemes belongs to different classes, e.g., جمهور قرغیزستان/Ҷумҳурии Қирғизистон (L304112) is currently having 5. In Danish, I have added many lexemes which are instances of four classes, e.g., jern (L43170), se (L35149) and blodpølse (L230784). — Finn Årup Nielsen (fnielsen) (talk) 00:44, 3 December 2020 (UTC)
  - I don't see why that would require the use of P31. Either we come up with a way to add P31 to all lexemes or drop it entirely. --- Jura 09:00, 3 December 2020 (UTC)
- "let's do away with numeral (Q63116) in lexical category" It is not clear to me why we should remove numeral (Q63116). I am under impression that it is a widely used POS-tag. Authoritative Danish dictionaries use numeral (Q63116) ("talord"), e.g., https://ordnet.dk/ddo/ordbog?query=fem (five). POS-tags for numerals in computational linguistics is there, see, e.g., [1] [2] (the latter "cardinal number", though) — Finn Årup Nielsen (fnielsen) (talk) 01:00, 3 December 2020 (UTC)
  - If it works for da, keep it for da then. I still regret that I used it for fr. --- Jura 09:00, 3 December 2020 (UTC)

« I think we should clean up the number of lexical categories and use values that denotes "as general a concept as possible". » yes absolutely! I'm in 1000%

and not only for the lexical categories (same principle to use atomic values applies elsewhere, especially for grammatical features). Didn't we already agree on that in previous discussion here?

« we have no means for property contraints » well we still have many tools available, SPARQL queries and schemas for instance. And also statistics on Ordia or Wikidata:Lexicographical data/Statistics/Count of lexemes by lexical category. I do some checks and corrections from time to time.

For your 5 proposal, I agree except maybe for affix (Q62155). It doesn't apply to your specific example but instance of (P31) is not always the best property, the information can be stored with other property (it would be even better, like first conjugation impersonal verb (Q53768605) should go conjugation class (P5186) for instance).

PS: a terminology detail, it's lexical category not lexicographic category

.

Cheers, VIGNERON (talk) 09:28, 3 December 2020 (UTC)

I read this whole discussion, but I'm unsure what is the best way forward. I assume that its not easy to introduce lexical category as property? No? Constraints sounds wonderful to me, but its probably easier to just read through the list of lexical categories in use and weed out the not welcome ones. I never did that though.--So9q (talk) 18:55, 6 December 2020 (UTC)

I looked through the work of fnielsen here https://www.wikidata.org/w/index.php?title=Special%3AWhatLinksHere&target=Q62155&namespace=146 and I must say it convinced me that affix + P31 for specifying is a viable way to go. Very nice work! I think we should make it an official policy to choose the broadest lexical category when adding new lexemes.--So9q (talk) 19:09, 6 December 2020 (UTC)

I agree with all 5 points except affixes. Also some properties (which are already in P31) can be modelled with has characteristic (P1552). --Infovarius (talk) 20:38, 7 December 2020 (UTC)

Can someone help import these?

It would be nice to do it and add combines with series ordinal also during import. Where there are missing lexemes it should create those also. https://en.wiktionary.org/wiki/Category:English_words_suffixed_with_-hood --So9q (talk) 14:04, 9 December 2020 (UTC)

Glossing non-English words in English (and vice versa)

I'm trying to figure out if I can use Wikidata to improve language learning resources on Wikibooks and Wikiversity, specifically by using lexemes as the basis for creating vocabulary lists and flash cards.

However, in looking over the way senses are used in the existing object graph, I had a few questions about the preferred style for the gloss property. For example:

The Irish word "uisce" and the German word "wasser", both meaning "water", are both glossed in English as "clear liquid H₂O"
The English word "water" is glossed in English as "common liquid substance"
The German word "Buch", meaning book, and the English word "book", are both glossed in English as "document"
The Irish word "féasóg", meaning beard, is glossed in English as "facial hair"

The documentation suggests that the "gloss" property should be used for a definition, which makes sense for an English gloss of an English word. But is that also the goal of a "gloss" from one language to another? Or would it make more sense for "uisce" and "wasser" to be glossed in English as "water" and "buch" to be glossed in English as "book"?

Basically, I'm trying to find out if these weird glosses I'm seeing are the result of bots being bots and I should just fix them? Or are they the intended use of this field and the actual translation of a sense should only appear in a sense translation statement?--Chapka (talk) 20:25, 6 December 2020 (UTC)

@Jacek Janowski: regarding the water and beard glosses. Mahir256 (talk) 20:38, 6 December 2020 (UTC)

Yeah, I think glosses in other languages than the language of the lexeme (and perhaps some close languages that are more widely used, for rare languages) are not very useful in general. ArthurPSmith (talk) 19:18, 7 December 2020 (UTC)

I would oppose using one-word translations into other languages than the language of the lexeme (especially using polysemous English), and favor muli-word gloss equivalent to a gloss in the-same-language-lexeme. --Infovarius (talk) 20:41, 7 December 2020 (UTC)

I want to make a case for short, plain language glosses in other languages. I think they could be extremely useful for foreign language learning applications.

The purpose of a gloss as I understand it is to explain to a reader of the glossed language what an unfamiliar word means. With that stipulation, "Book" is the shortest possible gloss of that sense of "Buch".
Not every sense of every lexeme will have an appropriate translation. For example: if I create a lexeme for the Irish seanfhocal (proverb) "Giorraíonn beirt bóthar" there won't be a sense for any existing English expression that exactly matches, and there doesn't seem to be a compelling reason to create a new one. Instead the simplest solution seems to be an English gloss: "Two people shorten a road".
Single-word glosses have many practical applications that multi-word translated glosses don't. I'm looking for a solution for flash cards, but also imagine a foreign-language text where you could hover over unfamiliar words for English glosses. Which would be a useful gloss for "uisce": "Water" or "liquid H20"? I can't think of any application where a one-word gloss would be useful and accurate but a translation of a native language gloss would be more useful.

Either way there needs to be a policy, because clearly there is no uniformity at the moment and the current documentation doesn't really address the issue. Perhaps I should draft an RFC on this setting forth a proposed policy and draft documentation?--Chapka (talk) 09:54, 8 December 2020 (UTC)

Here some statistics to help understand the situation right now with glosses :

we have 96 285 glosses (query, which is very low, as a reminder we have 343 678 lexemes, 5 149 181 forms and 85 033 senses)
the longest is 524 characters long (Lexeme:L70551#S3) and many are only one character long (which is understandable for glosses in Chinese like L:L7973#S1), the average is 39 characters long and the most common length is 16 characters (with 2706 glosses). Here is a query of a simplified barchart (length divided by ten to be more readable)
for languages, we have a lot of glosses in Basque and English (respectively 31k and 26k) followed by many other languages under 5k glosses, barchart query and under is a log X/Y graph of length by language (only for lenght more than 10 times for readability).
as far as I can tell and except for Basque, most glosses has been added by hand.

Cheers, VIGNERON (talk) 09:29, 10 December 2020 (UTC)

Graphs are temporarily unavailable due to technical issues.

Etymology of Lexeme (Inception, Creator, Period)

Hi All, regarding the etymology of words (specifically their period of time when they are first created or introduced into a language), which property are we to use as a best practice as a Lexeme statement (following the data model)...would that be inception (P571) when definitely known, and valid in period (P1264) ? Unfortunately the doc pages and data model visualization only say "period". The use case is dinosaur (L31823) which was created by Richard Owen (Q151556) in 1841 to describe the fossils of reptiles. He combined the Greek "deinos" and "sauros".

Then there is the question of which property to use to capture who coined the term or Lexeme? Let me know! --Thadguidry (talk) 15:18, 12 December 2020 (UTC)

That's a good question. I remembered that we also have time of earliest written record (P1249) which might be useful too in this aspect. Ainali (talk) 17:09, 12 December 2020 (UTC)

Suggestion for policy: no images on lexeme senses

See how widespread the practice is: Wikidata:Lexicographical_data/Statistics/Count_of_usage_examples_and_sense_images_by_language.

I suggest we remove all images from our senses after making sure that the Q-items linked have all got at least one image.

It's not a good idea to have redundant data i Wikidata IMO and tools like Ordia can easily fetch the image from the item for this sense (P5137)-linked Q-item instead. I have found danish lexemes with an image in a sense and the linked Q-item was missing an image. Ping @fnielsen, mahir256: who I know contributed to this. WDYT?--So9q (talk) 16:54, 13 December 2020 (UTC)

The reason why I add images to senses is that the sense is more aligned to the language. For instance, "postkasse" would in Danish contexts be a red box, while in other context may be different. It is possible to make a SPARQL query for the q-items that are missing an image but where one or more linked lexeme-senses has one. — Finn Årup Nielsen (fnielsen) (talk) 16:16, 14 December 2020 (UTC)

Thanks for participating and sharing your argument :). I had not thought about that at all. I accept your argument and your example, but does this mean you only add an image when the QID does not have a suitable image or do you always add an image? Maybe the latter is the preferred way forward?

We/I could create a tool to semi-automatically add image from QIDs to Lexemes and vice versa.--So9q (talk) 05:04, 16 December 2020 (UTC)

Lexeme, belongs to list of words for a certain CEFR level: how to define

This question has already been discussed here. There's no currently "on word list" property (see this proposal, so could one use part of (P361) instead? I've created an item to represent "List of words of Estonian language, corresponding to A1 level" (see List of Estonian words, corresponding to CEFR level A1 (Q104382002)), can you please confirm it's all right to progress with this? User:ArthurPSmith User:Nikki User:So9q. Sorry for mentioning, I am *very* new to Wikidata, so would not like to cause any harm to the community and database by my erroneous actions.. thanks in advance! --62mkv (talk) 16:03, 20 December 2020 (UTC)

@62mkv: Using part of (P361) and the item you created sounds like a good start at least to me. ArthurPSmith (talk) 18:04, 21 December 2020 (UTC)

@ArthurPSmith: Thanks! --62mkv (talk) 19:35, 21 December 2020 (UTC)

Wikidata talk:Lexicographical data/Archive/2020/12

Contents

Author on a definition (scientific definitions) and constraint issues

Values for lexicographic category

Can someone help import these?

Glossing non-English words in English (and vice versa)

Etymology of Lexeme (Inception, Creator, Period)

Suggestion for policy: no images on lexeme senses

Lexeme, belongs to list of words for a certain CEFR level: how to define

Navigation menu

Wikidata talk:Lexicographical data/Archive/2020/12

Author on a definition (scientific definitions) and constraint issues

Values for lexicographic category

Can someone help import these?

Glossing non-English words in English (and vice versa)

Etymology of Lexeme (Inception, Creator, Period)

Suggestion for policy: no images on lexeme senses

Lexeme, belongs to list of words for a certain CEFR level: how to define

Navigation menu

Search