Wikidata:Lexicographical data/Layout

From Wikidata
Jump to navigation Jump to search

This is a specification about how data can be modeled (organized) in Wikidata. Everyone can edit this page before Wiktionary support is deployed (and ideally Wiktionary support should not be deployed before the page is completed).

The page also includes properties needed for organizing lexeme data. all terms in <> refers to proposed properties, and the datatype (may be omitted) followed in parentheses; {} refers to proposed items.

Lemma

[edit]

Part of Lexeme

The lemma is a human readable representation of the lexeme. For languages with two cases of script, the entry name usually begins with a lowercase letter. For example, use work for the English noun and verb, not Work. Words which begin with a capital letter in running text are exceptions. For prefixes, suffixes and other morphemes in most languages, place the character "-" where it links with other words.

Most symbols such as # or | are allowed in lexeme and should be written as such. The title of a matched-pair entry consists of both left and right symbols, with a space in between.

Lexical category / Part of speech

[edit]

Part of Lexeme

This is a reference to a concrete Item.

Allowed POS headers:

  • Parts of speech: Adjective, Adverb, Ambiposition, Article, Circumposition, Classifier, Conjunction, Contraction, Counter, Determiner, Ideophone, Interjection, Noun, Numeral, Participle, Particle, Postposition, Preposition, Pronoun, Proper noun, Verb
  • Morphemes: Circumfix, Combining form, Infix, Interfix, Prefix, Root, Suffix
  • Symbols and characters: Diacritical mark, Letter, Ligature, Number, Punctuation mark, Syllable, Symbol
  • Phrases: Phrase, Proverb, Prepositional phrase
  • Lojban-specific parts of speech: Brivla, Cmavo, Gismu, Lujvo, Rafsi

Transliteration

[edit]

As Statement of Form

See [1].

Alternative forms

[edit]

As Form of Lexeme, or Representation of Form

Alternative forms to be provided as other Forms:

  • hyphenization/compounds: tea cup, tea-cup, teacup
  • style variation: naiveté, naïveté
  • uncertain capitalization: laser, LASER

Alternative forms to be provided as another language/variant of Representation in a Form"

  • regional variations: color, colour; center, centre
  • different scripts: реч, reč (Serbo-Croatian for word)

Unclear:

  • historical variations: anæmia, anaemia; coördinate, coordinate

Description

[edit]

As Statement of Lexeme or Item

<glyph description> (multilingual text) is a visual description the current symbol. See Wiktionary:Votes/2016-08/Description

Glyph origin

[edit]

As Statement of Lexeme or Item

Maybe we have a property "glyph derived from", but to model compound character we have to use a more complex layout.

Layout 1

[edit]
  • <glyph origin> (item)
  1. {borrowing}, qualifier <of lexeme>/<of form>/of (P642) e.g. A (Latin letter) => Α (Greek letter)
  2. {combination}, qualifier <of lexeme>/<of form>/of (P642) e.g. DZ => D and Z
  3. {modification}, qualifier <of lexeme>/<of form>/of (P642) e.g. G => C
  4. {pictogram}, qualifier <glyph interpretation> (multilingual text) e.g. 日 => sun
  5. {(simple) ideogram}, qualifier <glyph interpretation> e.g. 一 => one
  6. {compound ideographs}, qualifier <of lexeme>/<of form>/of (P642) and <glyph interpretation> e.g. 武 => 止 and 戈; army going on expedition
  7. {phono-semantic compound}, qualifier <semantic part> (lexeme/form/item) and <phonetic part> (lexeme/form/item) e.g. 菜 => semantic 艹 + phonetic 采

Note we currently use based on (P144) but it is for works.

Layout 2

[edit]
  • <glyph derived from> (lexeme/form/item), qualifier <mode of derivation> (item) and <glyph interpretation>

Etymology

[edit]

As Statement of Lexeme, or of Form

Etymology is preferably stored in lexemes, except suppletions (e.g. went). Only direct etymology is needed. It's better to create seperate lexemes for different etymology, rather than using a qualifier to indicate this.

Similarly we also have two possible layout for etymology.

Layout 1

[edit]

Layout 2

[edit]

This is the current layout at Wikidata:Wiktionary/Data model examples/hard (adjective, English).

  • <derived from> (form/item), qualifier <mode of derivation> (item)

Note this layout have several flaws:

  1. It may result in many duplicated qualifiers.
  2. It's not easy to handle multiple possible etymologies.
  3. Onomatopoeia can not be handled.

Common property

[edit]
  • <compound of> (phrase) e.g. "seeing is believing" -> "seeing", "is", "believing"
  • <akin to> (lexeme)

"phrase" may be a new datatype which is a ordered list of lexemes/senses. it may be modeled as a list of values ordered by series ordinal (P1545), but this is not useful for translation (as a term can be translated to different phrases in different languages).

Pronunciation

[edit]

As Statement of Form

See IPA transcription (P898), pronunciation audio (P443). Sometimes <refers to sense> should be used as qualifier.

Region or accent pronunciation may be indicated by qualifier valid in place (P3005), and new property <accent> (item), <standard in place> (item), <variant in place> (item).

As rhyme is a transitive relation, we use a single property <rhyme> (item) to express rhyme, rather than listing all words that is rhymed with a specific word (like Wikidata:Wiktionary/Data model examples/hard (adjective, English))

  • <syllabification> (multilingual text?)
  • <word with diacritical signs> (monolingual text?) e.g. līber, كِتَاب


Morphology

[edit]

As Statement of Lexeme

Separate lexemes(?) may be created for different morphology features.

  • <type of declension> (item)
  • <noun class> (item)
  • <grammatical gender> (item)?

Inflection

[edit]

As Form of Lexeme

Each form defines how a lexeme changes based on a specific syntactic role or mode it may take in a sentence. A form includes a list of grammatical features that define for which syntactic role the given form applies. These are given as references to a concrete Items.

Definitions

[edit]

As Sense of Lexeme

Each definition is a sense. The sense can be qualified by statements:

  • <grammatical property> (item): countable, uncountable, etc.
  • valid in place (P3005)
  • <apply to variant> (item)
  • <register> (item)
  • <subject field> (item) (reuse existing field of work (P101)?)
  • <connotation> (item)? (pejorative, etc.)
  • <grammatical frame> (string)
  • <classifier> (lexeme)
  • <attested since> (time)
  • <attested until> (time)

Usage notes

[edit]

As Statement of Sense

<usage note> (multilingual text)

Example sentences and quotations

[edit]

As Statement of Sense

  • <example sentence> (multilingual text)
  • <quotation> (multilingual text)

Semantic relations

[edit]

As Statement of Sense

  • <synonym> (sense)
  • <antonym> (sense)
  • <hypernym> (sense)
  • <hyponym> (sense)
  • <meronym> (sense)
  • <holonym> (sense)
  • <troponym> (sense)
  • <coordinate term> (sense)?

Derived terms

[edit]

As Statement of Sense

<derived term> (lexeme)

[edit]

As Statement of Lexeme

<related terms> (lexeme)?

Translations and denotements

[edit]

As Statement of Sense

Translations can be either done from sense to sense, or by a sense referencing a common Wikidata item. If the latter is done, the translations will be automatically displayed and kept up to date. This is only possible when the translation is symmetric and transitive, which is often not the case — but frequently enough to merit a specific implementation.

Descendants

[edit]

As Statement of Lexeme/Sense

<descendant> (lexeme)?

Identifiers and references

[edit]

As Statement of Lexeme

Anagrams

[edit]

As Statement of Form

Anagrams can be queried by common alphagram value.

<alphagram> (string)