Wikidata:Lexicographical data/Development/Proposals/2015-05

From Wikidata
Jump to navigation Jump to search

Previous plans

Start date Primary author(s)
2013-02 JAn Dudík
This, that and the other
Darkdadaah
2013-06 Denny
(Denny Vrandečić (WMDE))
2013-07 Micru
Francis Tyers
2013-08 Denny
2013-09 Ivadon
2013-10 Bigbossfarin
2014-10 GPHemsley
2015-05 Denny

Data Model

Terminology

Unfortunately, the terminology around dictionaries and lexical resources is easily confusing. Therefore we provide a terminology that should be used strictly and consistently throughout the proposal. In order to make it obvious, we will use the technical terms throughout in italics, like this.

  • A lexeme, also known as word or lexical entry, is what is described on one page in the lexical part of Wikidata. A lexeme consists of a lemma, a lexical category, a language, a set of forms, a set of senses, and a set of statements.
    • The lemma is the canonical form or dictionary form of the lexeme, e.g. for verbs this is usually the infinitive form, for a noun the nominative singular, etc.
    • The lexical category, also known as the part of speech or word class, defines the lexeme to be either a noun, or a verb, or an adjective, etc. The set of possible values is open and taken from the Wikidata items.
    • The language of a lexeme is taken from Wikidata items, and thus an open set.
    • A form is a specific, fully conjugated or inflexed form of the lexeme. A form consists of a representation, a set of lexical properties, and a set of statements. A form always belongs to one (and exactly one) lexeme.
      • A representation is the actual string value realizing a given form, e.g. the string value "wrote" for the past tense of the lexeme for "write". All representations are indexed for search.
      • A lexical property describes the form, e.g. tense or number for verbs, case for nouns, etc. This is an open set and points to Wikidata items.
    • A sense is described by a gloss and has a set of statements. A sense always belongs to one (and exactly one) lexeme (and lexemes belong to one language only). Senses are not independent of lexemes.
      • A gloss is a short description (translatable in all languages of the Wikidata UI) of one sense of the given lexeme.

The terms Wikidata item, property, string value, qualifier, statement, and claim are taken from the Wikidata glossary and have the same meaning here as there.

Notes

  • Transliterations in other scripts can be either handled by two separate lexemes or by a single lexeme with a statement on each form with the transliteration property pointing to a string value, with a qualifier describing the script. If the latter, transliterations will be indexed for search too.
  • Orthographic variants can be either done as two separate lexemes or by a single lexeme with statements on the appropriate level and qualifiers explaining the variant. If the latter, the variants will also be indexed for search.
  • Translations can be either done from sense to sense, or by a sense referencing a common Wikidata item. If the latter is done, the translations will be automatically displayed and kept up to date. This is only possible when the translation is symmetric and transitive, which is often not the case — but frequently enough to merit a specific implementation.

Example entry

  • (lexeme) L123 (won't be displayed)
  • (lemma) apple
  • (language) English (i.e. Q1860)
  • (lexical category) noun (i.e. Q1084)
  • (statement) pronunciation → IPA /ˈæpl̩/
  • (statement) syllable → "ap-ple"
  • (form) F272 (won't be displayed)
    • (representation) apples
    • (lexical property) plural (i.e. Q146786)
    • (statement) rhymes with → grapples (F404)
  • (sense/meaning) S2011 (won't be displayed)
    • (gloss) (en) tree of the genus Malus
    • (gloss) (de) Baum der Gattung Malus
  • (sense) S1989 (won't be displayed)
    • (gloss) (en) fruit of the apple tree
    • (gloss) (de) Frucht des Apfelbaumes
    • (statement) translation → Apfel (i.e. S9000, which is connected to W234, which has the lemma 'Apfel' and the language 'German')
    • (statement) hypernym → fruit (i.e. S239)
  • (linguistically related words)

etc.

Note that this is a single entry, i.e. forms and senses do not have their own pages but are part of the lexeme they depend on.

Tasks

Task 1: Wiktionary interwiki links

The Wiktionary interwiki links for their entries would not be handled by Wikidata in the way that Wikipedia interwiki links are handled. Instead, we need a new, central component which keeps track about all Wiktionary pages in the main namespace, and a client on each Wiktionary that queries the central list to display interwiki links on the given Wiktionary. This extension basically connects pages with the same name on the different Wiktionaries to each other. Since the granularity of pages on Wiktionary is different from the planned granularities of Lexemes in Wikidata, a different solution for sitelinks is needed than for Wikipedia. The specifics of interwikilinks in Wiktionary - as said, they are mostly linking pages with the same name to each other - make it rather easy to create a special case tool: an extension that creates the set of Interwikilinks for a given page in the configured namespaces (usually the main namespace on Wiktionaries) by looking for pages with the same name on the other Wiktionaries (or more general, on other projects within a configured namespace). Then, add and overwrite with local links mentioned in the wikitext.

Note that this might also be used for the User-namespace on other Wikimedia-projects, thanks to the finalized SUL.

This new component would not create any new Items or other Entities in Wikidata. The majority of Wiktionary interwiki links would be handled outside of Wikidata.

This task, unlike most of the other tasks, can probably also be tackled by volunteers or outside of the main development team. It can also be done long before the rest is started.

==> This task is now achieved, using Extension:Cognate!

Task 2: Switch on Phase 1 for Wiktionary

Once the new component from Task 1 is switched on, the usual Wikidata Phase 1 can be enabled for Wiktionary. This will allow to create interwiki links for those pages that would not be connected through the new component, e.g. to connect the tea room with le questions sur les mots, etc.

The Task 1 component thus has to be additive with the functionality of Wikidata (so called Phase 1), which can then be used to connect pages not in the main namespace to each other. Switching on Phase 1 for Wiktionary (i.e. providing these sitelinks from Wikidata) must happen after this extension has been enabled (or else people will create hundred of thousands of items on Wikidata for the purpose of providing these trivial sitelinks).

==> This task is documented on Wikidata:Wiktionary/Sitelinks

Task 3: Lexeme entity type

Has a single Label (not per language as for Items), Language, Word type, and Statements, but no Description or Sitelinks.

Note that two words in two different languages who happen to be the same (e.g. arm@en and arm@de) are two different lexemes, but also two different words within a language with different grammatical properties are described in two different lexemes (e.g. walk@en as a noun or as a verb).

The Lexeme entity type would not have Q-Ids (these are reserved for Items), but L-Ids.

Task 4: Embedded entity type

Form and Sense are conceptually Entities, but they don’t have their own wiki page, they are embedded in their hosting Lexeme page. Might require a bit of refactoring in existing code.

Task 5: Form entity type

Has a (single, not per language) Label (monolingual text), Grammatical markers, and Statements, but no Description or Sitelinks.

Task 6: Sense entity type

Has a Gloss (multilingual text, like a Label or Description for Items) and Statements, but no Label or Sitelinks.

Task 7: Extending search

Search on Lexemes use Language and Word type for autodescription, followed by a disambiguator if needed (i.e. See@de would be “See // German noun (1)” and “See // German noun (2)”). Alternatively, it could use the first sense description to disambiguate. Search also triggers on Forms (just as it triggers currently on Aliases for Items, e.g. type [went], find “go // English verb // Past tense: went”).

Task 8: Arbitrary access

Allow Wiktionary clients (i.e. the current Wiktionary projects) to access arbitrary data from Wikidata, so the clients can do whatever they want with it (e.g. create content for Wiktionary such as flection tables, etc., or even larger parts of entries for languages that are otherwise not well supported in a given project, etc.).

Task 9: Link Wiktionary from Wikidata

Display appropriate links to Wikidata, based on the central Wikidata article list. Appropriate places are likely Lexemes and Forms. Note that these links are not saved in Wikidata, but generated and displayed.

Task 10: Assess further needs after deploying interwiki links extension

Check which interwiki links remained in Wiktionary and figure out if more needs done. Probably the community will have told us what more needs to happen by now, but if it didn’t, ask and listen. Might create further tasks. This discussion about additional needs should happen after Task 1, Task 7 and Task 8 are done, or else the current situation would be discussed instead of the newly created one.

Task 11: Compact view for Forms

The Forms sections is far too big in default view. Instead, introduce a more compact default view for Forms, that can be expanded inline. See mock-ups below.

Task 12: Compact view for Senses

The Senses section is quite big in default view. Instead, introduce a more compact default view for Senses, that can be expanded inline. See mock-ups below.

Task 13: Supercompact view for Forms

The compact Forms view can still be quite big (especially for Finnish verbs and similar Lexemes). Introduce a supercompact default view for Forms, that can be expanded inline to the compact view. See mock-ups below.

Task 14: Handle multiple representations

For languages like Serbian, Uzbek, Kazakh, Chinese, etc, that use several scripts, the new structure is not ideal (but optimizing for these few languages would be detrimental for the other languages). Once we are here (say, once Task 7 is done), we need to solve the issue of multiple Representations, possibly by adding some special handling to Forms or by automatic transliteration mechanisms. The actual solutions need to be assessed and discussed with the wider community at this point (but not much earlier, so that their fit in the overall architecture can be meaningfully discussed).

Mockups

Extended view

Compact view

Supercompact view

Acknowledgements

This document has been discussed and extended in Discussions with Lydia Pintscher and Daniel Kinzler. Further acknowledgments are given in the 2013-08 proposal.