Wikidata:Making sense

From Wikidata
Jump to navigation Jump to search

This essay attempts to create a straw-dog proposal on how to model senses, and collect some resources on the topic.

Background

[edit]

Current usage on Wikidata

[edit]

The following query can be used to show which properties are used in statements on the senses on a given part-of-speech.

The following query uses these:

  • Items: verb (Q24905)  View with Reasonator View with SQID
    select ?p ?pLabel ?count {
      {
        select ?prop (count (*) as ?count) {
          ?lexeme wikibase:lexicalCategory wd:Q24905 . # change the POS here
          ?lexeme ontolex:sense ?sense .
          ?sense a ontolex:LexicalSense .
          ?sense ?prop ?sth .
        } group by ?prop 
      }
      ?p wikibase:directClaim ?prop .
      SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
    } order by desc(?count)
    

This query gets you the different part of speech and their frequency: https://w.wiki/38Dx

Rough results

[edit]

Observations

[edit]

Whereas we are linking frequently to items, we are definitely not linking much between the individual senses, even though other work (see below) has identified these as relevant. There are two explanations for that, and my guess is they are both playing major roles:

  1. some of the relations, like some of those in Wordnet, are already well expressed in our ontology. We wouldn't say "window holonym house" because, unlike Wordnet, we have a whole ontology describing windows and houses. Repeating that in the lexicon seems superflous. Similar for hyponyms.
  2. the current UX makes it very difficult to connect to a sense. That is likely slowing down the creation of statements that have other senses as the value, e.g. translation, synonyms, antonyms, etc. This should be a high priority to fix, and is expected to increases the usage of these properties.

Some resources and papers

[edit]

Proposal

[edit]

Suggestions

[edit]

Suggestion #1: Fix the UX for editing of senses as values

As stated above, currently editing values that are senses is not a great experience and likely slows down considerably the addition of many possible statements. This skews the data and makes it hard to understand what model could possibly work. Any suggestions that considerably change the way we are doing things should probably postponed until we have good data on how a better editing interface changes behaviour and data quality.

Besides that, there are still a few further suggestions to be made.

Suggestion #2: have two properties, one exact, one not, that links senses to items

The Wikibase Lexeme Data Model states in a note that if two Lexemes have senses that refer to the same concept, this should not imply that the two lexemes are synonyms (or translations). It gives the example that the Lexemes hot (L3299) and cold (L3296) could both have a sense that refers to temperature (Q11466).

The suggestion is to have two properties, one that allows for the inference that all senses connected to the same item through this property are indeed synonymous (if in the same language) or translations of each other, and one that does not. I don't care much about which one of these item for this sense (P5137) would be - the exact one, allowing inference, or the fuzzy one - but we should have both. One suggestion would be to rename the exact one to "denote" and the fuzzy one "evoke". A similar proposal failed previously.

Since this is a straw-dog anyway, the suggestion is to make item for this sense (P5137) and introduce a new one that is not.

Suggestion #3: have two properties for synonyms and two properties for translations, both exact and fuzzy

Sometimes translations or synonyms are between words that have a similar enough meaning, but are not exact. In particular, they do not allow to assume that they are transitive. An exact translation and synonymity should be transitive across each other. So if the sense L1-S1 is synonymous with a sense L2-S1 which translates to L3-S1 which translates to L4-S1 which is synonymous to L5-S1, all mentioned senses should be synonymous / translations of each other if the property is exact.

Suggestion #3 (alternative): assume that synonyms and translations are exact unless they have qualifiers

Instead of introducing new properties, we state that synonyms and translations are always exact unless they have qualifiers. The qualifiers need to explain the fuzziness. (This seems potentially error prone and problematic).

Suggestion #4: once more than four senses are connected as exact translations or exact synonyms, an exact defining property should be used instead

Whenever we have five senses mutually connected as exact translations or exact synonyms, we should switch to using a common definition instead. This is meant as both a threshold from overcreating items which are hard to keep apart, and a method to avoid to have an unmanagable number of synonymity / translation relations which are hard to keep track of. In the suggestions so far we have only introduced one exact defining property, an exact version of item for this sense (P5137). But we already have others, such as demonym of (P6271). Any of these are good as an exact defining property.

Suggestion #5: mark exact sense-defining properties

Every property that can be used as an exact defining property for a sense should be marked as such, using a instance of (P31) on the property connecting it to a new item for "Wikidata exact sense-defining property". We would mark the exact version of item for this sense (P5137) as well as demonym of (P6271) and others.

This would allow us to both grow the ontology for senses and still have a common pattern to find all translations.

Note that not all of these properties have to link to items. If they define a sense, they can also link to other values, such as numbers, quantities, etc.

Note that these properties can have very varying meanings. They don't need to mean "this sense of a lexeme denotes exactly this item", it could also mean "this sense denotes the lack of this item", etc.

A property "this sense denotes an instance of this item" would be very interesting and potentially helpful, but usually not exact sense-defining. E.g. it could link cold (L3296) with temperature (Q11466), but so would hot (L3299) - still helpful, but not defining.

One alternative might have been to make these properties subproperties of a common superproperty instead of marking the sense-defining properties. Due to the semantics of superproperties, this actually doesn't work out: if we rely on the superproperty to find translations, we would actually mix together senses using different subproperties linking to the same item.

Suggestion #6: make translations easily visible and accessible

Now that we have two different ways to express translations (and also synonyms), we should make translations easily visible and accessible to our users. This should probably start as follows:

  1. a tool similar to Ordia to make the translations visible
  2. and ideally accessible through an API
  3. a Wikidata widget that displays the translations inline on the site (using that API)
  4. eventually an integration of this in default Wikidata, both providing an API and in the UX of Senses
  5. ability to display that list in the Wiktionaries (once they can display content from Wikidata in the first place)

Specific cases

[edit]

Whereas the current inventory of items is particularly useful for nouns, we don't seem to have a great inventory for the meanings of verbs and adjectives. But given the above suggestion, we can start building one - and by following the suggestion we won't be creating items for these senses spuriously, but only when we actually identify a verb meaning cross-linguistically in a sufficient number of languages, and if there is no other property that can be used for defining a sense.

There are certain specific cases that have been discussed previously (see, for example here or here. The hope is that many of these cases can be solved using the suggestions above. Let's dive in.

One interesting query to understand the current state is this: https://w.wiki/38Zj (you can change the POS, fix the language, and change the property connecting senses with the item to explore).

Relational adjectives

[edit]

Relational adjectives: states that something is somehow related to a specific item, e.g. human (L58772) to human (Q5), medieval (L37280) to Middle Ages (Q12554), plastic (L9636) to plastic (Q11474), alpine (L29565) to Alps (Q1286). Currently, these all use item for this sense (P5137). Maybe that's good enough for such relational adjectives? Do we need a new property?

Color terms seem to fall into this as well, and seem to work very well.

For place names in particular, we also have a wide usage of demonym of (P6271). This resolves the ambiguity whether we are referring to the place or the people from the place (i.e. Ukraine (Q212) vs Ukrainians (Q44806)), as it would always use the place.

We also have the property pertainym of (P8471), but that connects to a sense, which makes it unsuitable for a sense-defining property (since the senses are language specific, we would never have a translation using this property directly).

We could create a new property for "adjective sense that relates the described object closely to the following item or an instance of given item" instead of item for this sense (P5137), also in order to see if item for this sense (P5137) is used in other circumstances too.

Adjectives along dimensions

[edit]

Some adjectives are along a specific dimension, and then have a polarity on that dimension, e.g. slow / fast / mercurial / quick all on speed, cold / hot / freezing / warm / lukewarm all on temperature, etc.

Stating where each of these senses would fall on a given dimension is tricky. Would we have one scale of high / low or positive / negative? But this polarity is difficult to define. Sometimes dimensions are complex, and don't have just a single dimension, e.g. taste.

Whereas we could possibly model dimensions and polarity on the senses, it seems that for most of these we already have items actually. There's hot (Q28128222) and cold (Q270952), sourness (Q2795949) and umami (Q202637), celerity (Q101813351) and slowness (Q24249439).

It might be sufficient to use them with item for this sense (P5137)?

Verbs as acts of doing an item

[edit]

Some verbs describe the act of doing or creating a specific item, e.g. rire (L9409) to laughter (Q170579), panic (L14444) to panic (Q208450). These seem to be often connected using item for this sense (P5137), which seems to look fine.

Looking at the examples of such usage, we can see some problematic entries: manger (L309)-S1 to food (Q2095), cloud (L11072)-S1 to cloud (Q8074), skirt (L13295)-S1 to skirt (Q2160801), etc., but almost all of the one I checked seemed to be errors committed through the use of the MachtSinn tool. But these errors seem quite frequent.

It would be interesting to find out where, besides these errors, the usage is problematic. The food example, e.g. is repeated a few times: manger (L309) (French), eat (L1340) (English), makan/ماکن (L692) (Malay), їсти (L43602) (Ukrainian), makan (L6588) (Indonesian), comer (L230705) (Spanish), (L312662) (Mandarin), /chia̍h/tsia̍h (L348903) (Southern Min). All of these seem to stem from a single QuickStatements run, and should thus also not be seen as representative.

It is likely that we need to find other patterns for verbs that do not fit well into this pattern, and it is unclear how to do that.

We could introduce a new property for "verb senses that create or describing acting instances of the item" instead of using item for this sense (P5137), also in order to see if item for this sense (P5137) is used in different ways as well.

Gendered professions

[edit]

We currently seem to be inconsistent regarding the representation of professions in those languages that gender the respective words. In English this is rarely the case - English has actress (L7012) and actor (L7011), but there's only one word for teacher (L5219), whereas many other languages, such as German, separate Lehrerin (L34168) from Lehrer (L34167).

If we were exact, Lehrer (L34167) should probably have (at least) two senses, one for male teachers and one for teachers generically (the latter usage is becoming increasingly dated, see also the German term "mitgemeint"). The dated sense would be an exact translation of the English word's sense, but the first sense would not have a Lexeme in English.

There are several possibilities to resolve this:

The one-Lexeme solution would not have a grammatical gender on the Lexeme, but instead have the gender as a grammatical feature on the respective forms. In that case, the problem of representing Lexemes for gendered professions goes away.