Wikidata talk:Lexicographical data/Archive/2023/08

This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Genders and lexemes

While taking a look at Lehrer (L34167)/Lehrerin (L34168), I made the observation that the letter is essentially being considered a compositum of the former lexeme, with an additional ending. This appears to work fine for German, but then I began wondering about other languages that don't conveniently come up with an 'endingless' base lexeme like in this case.

If we take for example Lithuanian, we would have mokytojas (masc.) and mokytoja (fem.). Creating a base lexeme mokytoj (just the root) would probably not make any sense. Have there already been some thoughts on this issue?

The general lexicographical practice for Lithuanian would be an entry like this: mókytojas (-is), -a, mokýtojas (-is), -a, mokytójas, -a.
We essentially have one lexeme ('mokytojas'), and the other forms (fem. forms, forms with alternative masc. endings, alternative forms with different stress patterns, …) are part of the same lexeme. I also don't think it would make sense to create eight (if I counted correctly) separate lexemes for this very example here on Wikidata. What kind of solutions does Wikidata provide to get it all into one lexeme? And isn't the solution with the 'combines lexemes' property used for German generally a bit half-baked if it concerns merely a distinction between grammatical gender? Vogone (talk) 21:30, 28 July 2023 (UTC)

I'd support just using a single lexeme for cases like this. Probably should be done in German also. ArthurPSmith (talk) 17:55, 31 July 2023 (UTC)

The way I understand it (for my mokytojas example), the best way forward would be:

Create a lexeme mokytojas.
Add all inflected forms for mokytojas, namely: mokytojas (masc. main), mokytoja (fem.), mokytojis (masc. alt.) plus all forms derived from these.
Add accentuation information to all these forms (see also the topic #Pitch in European languages on that). Add multiple accentuation qualifiers for forms where multiple ways of accentuating are possible. In case of the example at hand, namely mókytojas, mokýtojas, mokytójas, I now assume we would be treating it in Wikidata as one form (mokytojas) with three possible accentuations.

Does that seem alright? I am currently working on the documentation for Lithuanian, which is why I am asking these annoying questions. Thanks for your help. Vogone (talk) 14:08, 4 August 2023 (UTC)

Hmm, I'm not an expert by any means but I would have thought the accentuation would be included as part of the form (it affects meaning, right?); I don't think we have a way to store "accentuations"? So three separate forms, not one form with 3 accentuations, is what I think would be right here. Maybe somebody else can comment? ArthurPSmith (talk) 16:09, 4 August 2023 (UTC)

In this very case it does not affect meaning. These are three parallel forms of the same word. Similarly to how the English word either has two possible pronunciations. Regarding the way to store "accentuations", see my comments in the #Pitch in European languages section below. I believe it is possible to store this information with these properties, but I'm not sure how to go about expanding these properties to allow their use for Lithuanian. Vogone (talk) 17:12, 4 August 2023 (UTC)

Let me propose you the scheme already used for Russian words with the use of pronunciation (P7243) - look example of different accentuations at творог (L43821). --Infovarius (talk) 22:34, 5 August 2023 (UTC)

Ok, this approach makes sense to me then - one form with the pronunciations as values of a property on it. ArthurPSmith (talk) 20:26, 7 August 2023 (UTC)

Thanks for the tip! That still leaves the question if and how to mark the intonation classes. In dictionaries, class numbers (1 to 4, with a fair share of subgroups for category 3 nouns) are usually being provided for each noun. Depending on the class, you can then theoretically derive the intonation pattern from the base form. tone or pitch accent class (P5426) sounds like it should be suitable for such information, but it is currently restricted to tone type information (you are supposed to only provide Tone 1, Tone 2 or Tone 3 as argument). While Lithuanian does in fact have three tones that are being distinguished, the exact tone does depend on the form. One class is not necessarily restricted to one tone, like the property configuration currently implies. It probably would make much more sense to just have the name of the accentuation class as argument of the property, in line with what the property label actually suggests it states. I don't have a great idea yet how to resolve this problem. Vogone (talk) 11:46, 8 August 2023 (UTC)

I don't think we should pretend all languages behave the same way. I can't comment on Lithuanian but the German feminine words are derived from the masculine words by adding a suffix, it's not declension like for adjectives, so I don't think it's correct to treat them as the same lexeme, and doing so would make things a lot more complicated. The pairs are often separate words in the sources we have external identifiers for, so we would end up with lots of words with multiple identifiers. The masculine words can have other meanings which the derived feminine word doesn't have (e.g. Messer (L407202), Drucker (L226907)) and the derived feminine word can have meanings which the original masculine one doesn't (e.g. Männin (L729013)). The masculine and feminine words have different values for paradigm class (P5911) and their senses can have different values for things like language style (P6191). Describing which senses and forms go together and which can be used for which natural genders would be harder (whereas right now we use semantic gender (P10339) on the sense and that's all we need). - Nikki (talk) 20:58, 23 August 2023 (UTC)

Pitch in European languages

Several European languages do have minimal pairs where the pitch of a stressed syllable determines meaning. One of these languages is Lithuanian, where pitch and accent position is consistently being marked in dictionaries. Looking at the available properties, it appeared to me that position of accent nucleus (P5427) could be applied to Lithuanian. tone or pitch accent class (P5426) could be used to store information about the accentuation classes, which all happen to have established names in Lithuanian, too. However, these properties currently are restricted to a certain small list of languages. Is there anything speaking against expanding the properties in question to allow languages like Lithuanian? In case you are unfamiliar with Lithuanian, feel free to take a quick glance at w:Lithuanian accentuation, which uses the most common accent markers that you would also find in dictionaries. Entering this information into Wikidata based on syllable count (like it is done for Japanese, apparently) feels a bit unusual, but should in principle be possible. Vogone (talk) 13:56, 4 August 2023 (UTC)

@Vogone The tone or pitch accent class (P5426) property was originally used just for Japanese, and I extended to use with Punjabi after some discussion here. If you think it is appropriate to add more languages you should do so. There are several—all language specific—ways to document pitch, accent, and prosodic properties but I can explain what I have been doing in case it would be helpful. It looks like based on this article there is more than one prosodic system which can influence the use of a word which is also the case for Punjabi.

Tone (or pitch accent) can be classified at the form level or the word level. Every word form in Punjabi can be described in terms of relative pitch: low, level, or high tone occurring on the vowel of the accented syllable. However, I avoid this when possible in favor of word level tone categories. The pitch and accent can differ between inflected forms, but with the word level categories we can assign values to patterns in those differences rather than the actual pitches. Take the word ਲੱਭਣ لبھّݨ “labbhaṇ,” to find—
- Basic form /ˈlɐ́b.bəɳ/ (high tone, stress on syllable 1)
- Causative form /ləˈbɔ̀ɳ/ (low tone, stress on syllable 2)

Instead of marking each form based on its pitch, we can say that labbhaṇ is a “Tone 3” word, and Tone 3 words by definition have the same change in pitch and accent position corresponding to these forms.

Besides tone, there are still factors which influence verbs in particular based on prosodic properties of the stem. Some stems for example incur a “vocalic release” at junctions with suffixes for example, and I have been using the more generic property paradigm class (P5911) to add values such as [ʊṭːh(ə)] (21) (Q121789721) which describes one of the conditions under which a vocalic release is expected. (There are 72 of these prosodic classes in Punjabi, and I have only created items for a few so far. Once they are more complete, I would like to be able to use them to make inferences about individual word forms and their syllables rather than have to mark every detail explicitly.)

عُثمان (talk) 21:24, 23 August 2023 (UTC)

Wikidata talk:Lexicographical data/Archive/2023/08

Genders and lexemes

Pitch in European languages

Navigation menu

Search