Wikidata talk:Lexicographical data/Archive/2020/10

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Abbreviations

Yesterday, I added all the USPS abbreviation (Q30619513) for geographic directionals and street suffixes to lexemes as forms with the feature USPS abbreviation (Q30619513). ArthurPSmith pointed out that this is a suboptimal way to represent abbreviations, because they don't necessarily have different grammatical properties than their spelled-out forms. I had been following an existing example of an abbreviation form in Special:PermanentLink/1230867700#F3, but I see Arthur's point that a property would be more flexible. It certainly would've shortened the federated query I was building in OpenStreetMap:SPARQL examples#Abbreviated street addresses in Cincinnati.

How should we model these abbreviations? Should I piggyback on Wikidata:Property proposal/Alternative form, qualifying alternative form statements with determination method (P459) set to USPS abbreviation (Q30619513)? Or should there be a separate property corresponding to USPS abbreviation (Q30619513)?

 – Minh Nguyễn 💬 17:59, 3 August 2020 (UTC)

So the alternative I was vaguely thinking of here was having a property on a form (for example the existing form entry for "street") that has as value the standard abbreviation ("ST"), rather than having it as a form on its own. But maybe the separate form version is ok. It wouldn't be a property for a regular item like street (Q79007) though because it's specifically related to the English word it replaces, not its conceptual meaning. ArthurPSmith (talk) 18:57, 3 August 2020 (UTC)
Would such a property best belong to a form, or to a sense? Taking the street address "St" as an example, the corresponding Swedish abbreviations would be "g" for "gatan" (street) or "v" for "vägen" (road). Swedish street names are usually compound words expressed in definite form, such as "Storgatan" (Main Street) or "Byvägen" (Village Road), though when named after people they appear as separate words in indefinite form, with the person's name in genitive ("Olof Palmes gata", "Sernanders väg"). The abbreviations are the same anyway, regardless of form ("Storg", "Byv", "Olof Palmes g", "Sernanders v"). They are however restricted to street addresses only, never used in other contexts, senses or grammatical constructs involving said words. --SM5POR (talk) 02:20, 10 August 2020 (UTC)
I think attaching to the form is better; if there happen to be two different meanings for the same word in an address it would still have the same abbreviation, and the abbreviation for plural forms may be different from that for singular forms. ArthurPSmith (talk) 18:11, 10 August 2020 (UTC)

2

User:Jsamwrite recently added some abbreviations as synonyms to main Lexeme: example. I suppose it's better to have them as forms of such Lexeme, what do you think? --Infovarius (talk) 18:33, 10 August 2020 (UTC)

Ah, days of the week in Polish... I agree that abbreviations shouldn't have separate lexemes, just as mentioned in the case starting this thread (unless they have become words in their own right, such as "laser", or perhaps even pronounced acronyms like "DNA", "TV", "PhD"); as is pointed out above, they rarely (if ever) have different grammatical features than the fully spelled out versions.
But forms? There isn't a grammatical form "abbreviative" (and I don't think we should invent one). The abbreviations we are talking about here are merely written variants of the same words for use in special contexts; they are never pronounced that way (if they happen to be in some language, please educate me). If we begin adding abbreviations to lexemes without having dedicated support for that (such as a separate property), I'm concerned the simple lexeme model will either explode in expressiveness (but not necessarily in usability) or become an entangled mess of overlaid interpretations, as there are other written variants than abbreviations to consider: multiple writing systems (Latin or Cyrillic), numeric notations ("twelve", "12" or "XII"), separate graphemes ("and" or "&", "copyright" or "©", "dollar" or "$") and so on (I'll make a point about that expression "and so on" later, btw). If a spelled-out word and its abbreviation don't differ in pronunciation, I fail to see how they can have different plurals, cases or genders. Adding them as forms would either result in a combinatorial explosion with existing forms, or override all other forms with a non-grammatical variant character sequence. --SM5POR (talk) 10:17, 11 August 2020 (UTC)
I think the explanation we had received for pl is that accronyms can also have countless forms in that language. I don't know if it also applies to other abbreviations in pl.
As for abbreviations in most languages, I'd add them as form on the entity with the unabbreviated form. --- Jura 17:57, 15 August 2020 (UTC)
I then I came across SPQR. (L285219) ;) --- Jura 18:04, 15 August 2020 (UTC)
In my opinion "TV" may be a lexeme covering all terms that is abbreviated as TV.--GZWDer (talk) 15:33, 8 September 2020 (UTC)
My only experience with lexemic editing is Lexeme:L252247#F5, which seems pertinent to this discussion. But then, it was for generational suffix (P8017) which usually would take the form of an abbreviation or numeral...so would that property need to change pending the outcome of this discussion? Arlo Barnes (talk) 18:30, 3 October 2020 (UTC)

Items missing a definition

I just discovered a couple of items which are heavily being used in the Lexeme namespace, but they have neither any sitelinks nor any statements; in other words: they need a proper definition. Can someone please have a look and add some statements?

Thank you, —MisterSynergy (talk) 09:27, 6 October 2020 (UTC)

@Uziel302: who created most of them. Cheers, VIGNERON (talk) 12:27, 8 October 2020 (UTC)
MisterSynergy, most are just instances of possessive (Q2105891) and the last is instance of imperfective (Q371427). Added statements of instance of to all. Uziel302 (talk) 16:27, 10 October 2020 (UTC)

Almost empty lexemes: how to detect and what to do with them?

Hi,

I see (mainly through the maintenance queries on Wikidata:Lexicographical data/Ideas of queries) that a lot of lexemes are created almost empty, with just the 3 mandatory minimal informations: lemma, language and lexical category.

But these data are not enough (two different lexemes can have these same informations for instance, see discussion supra) and their absence can be problematic in many ways (not possible to check the Wikidata:Lexicographical data/Notability, difficult to improve as it can be verious different lexemes the intention of the creator is thus unclear).

My question are :

  • how can we easily detect them? (SPARQL queries - see link supra - are great but focus on some specific aspect - languages, missing forms, etc. -, not the whole picture - they often timeout if we try -, so some might go unnoticed ; I thought Special:ShortPages could filter by namespace but apparently not :/ maybe a SQL quarry)
  • how to deal with them? especially: with unclear initial prupose for these entity, should we assume one purpose? (I don't like repurposing entity but here there is not much choice :/ ) ; should we delete them and restart clean? (a bit too brutal), any other idea, remark, suggestion?

Cheers, VIGNERON (talk) 10:26, 8 October 2020 (UTC)

Connecting translations automatically

Is there any automated way to connect translation (P5972) between multiple senses of Lexemes? Like, if A claims is a translation of B, and B is a translation of C, then automatically create the claim that A is also a translation of C (and vice-versa). This would also be useful for gloss, to avoid repetition... Or should I just ignore this and focus on item for this sense (P5137)? --Luk3 (talk) 01:31, 1 October 2020 (UTC)

@Luk3: I'm not clear on exactly what you are asking for, but those relations could certainly be derived through a SPARQL query, for example. ArthurPSmith (talk) 12:00, 1 October 2020 (UTC)
@ArthurPSmith: what I mean is: I created the lexeme 'y (L238278), that means "water" in Tupi. I then made the statement ''y (L238278) translation (P5972) water (L3302)', which is "water" in english. Now, I need to add all other remaining translations to 'y (L238278), and I wanted to know if there is a tool to import the translations from water (L3302) automatically, since they are the same thing, and would share translations with all the other languages. Maybe that's not necessary since a SPARQL query could find all the translations anyway, but I hope to have been clearer on the question.--Luk3 (talk) 17:02, 1 October 2020 (UTC)
A while back I suggested deletion of that property or that we stop using it on lexemes at least. It doesn't scale and the best way to derive translations is via sparql via senses with P5137. I can recommend using MachSinn to add these semi automatically. It should be used with care though as it has some quirks.--So9q (talk) 05:32, 30 October 2020 (UTC)

Duplicates

After a discussion on the telegram chat with @Nikki: I wrote a small script to check for duplicate Lexemes. The script is still too strict, but I found a first set:

There seems to be quite a few more. What should we do about them? Manually merge? Where do I post the rest of the list when I get to it? --DVrandecic (WMF) (talk) 16:30, 2 October 2020 (UTC)

This list doesn't seem to long to handle manually (at least for now). The ones I looked at definitely looked like they were identical in having the same lemma, language, lexical category, and no other identifiers or other features to distinguish them. However we might want somebody who actually knew these languages to look at them first to confirm, before merging? There are certainly cases in English where the same lemma and lexical category belong to two or more quite distinct lexemes (based on etymology and sometimes even pronunciation). ArthurPSmith (talk) 20:47, 2 October 2020 (UTC)
I think it's fine to merge them - they're not actually representing distinct lexemes until there's enough data to tell them apart. In the cases where there should be multiple lexemes, people can always create a new lexeme (and add enough data to distinguish it from the existing one) later. - Nikki (talk) 21:43, 2 October 2020 (UTC)
@DVrandecic (WMF), Nikki: Ok, these are all merged (some of them were done before I got to them). ArthurPSmith (talk) 15:49, 5 October 2020 (UTC)

@ArthurPSmith: Thank you! Awesome!

I have now a list with ~3,700 potential more candidates, but these also include proper homographs. Looking through a few of them, I found the following property: homograph lexeme (P5402), e.g. Lexeme:L6227#P5402

I guess we should require that all such Lexemes if they are proper homographs and should not be merges should be linked through homograph lexeme (P5402), what do you think?

Furthermore, I will upload that list to the wiki once it is done, but I don't think we should manually go through it. We can at least automatically check if all forms are the same, and put them into the potential merge list. Those that seem not to be the same should probably go through an "add P5402" list. What do you think? --DVrandecic (WMF) (talk) 23:41, 5 October 2020 (UTC)

There are some more annoying cases: Lexeme:L139965 - the word wikt:пайке may refer to two lexemes, but none of which are in lemma form. they can not be simply merged.--GZWDer (talk) 07:20, 6 October 2020 (UTC)

The list is here for now: Wikidata:Report on potential duplicate Lexemes. I don't think we should go through it manually yet, but rather filter it automatically better as suggested above. --DVrandecic (WMF) (talk) 17:19, 6 October 2020 (UTC)

One step towards cleaning this up would be to make sure that P5402 is used symmetrically. I see that the symmety constraint currently breaks on Lexemes (need to be fixed), but this query should be empty, but currently has ~85 results. --DVrandecic (WMF) (talk) 19:08, 6 October 2020 (UTC)

@DVrandecic (WMF): thanks for the query. I added the symetric statements for the 6 lexemes in French, each and every time it was a pair of noun/adjective, and therefore clearly not duplicates . Cheers, VIGNERON (talk) 16:42, 9 October 2020 (UTC)
@DVrandecic (WMF): I think homograph lexeme (P5402) should only be required if they have the same lemma, language, and lexical category. Otherwise there are likely to be too many. Also if you could filter your list for cases where the lexical categories differ that would be helfpul... ArthurPSmith (talk) 16:58, 9 October 2020 (UTC)
@VIGNERON, ArthurPSmith: That is a perfectly valid point. I changed the query to filter if the lexical category is different, and the number of offending Lexemes went down from 75 to 14. Query: https://w.wiki/gbJ Thanks! --DVrandecic (WMF) (talk) 19:50, 12 October 2020 (UTC)