User:Rua/Wikidata for Wiktionarians

From Wikidata
Jump to navigation Jump to search

Welcome to Wikidata! Wikidata is a collaborative project to collect data about all kinds of things in a structured way. Since 2018, this has included lexicographical data: data about words and phrases. This is very similar to what is found in Wiktionary, but the way the data is entered, stored and presented is very different. This page is a guide to editors who are familiar with the English Wiktionary, and explains how Wiktionary concepts and workflows map to Wikidata ones.

Structure of lexicographical data[edit]

The top-level construct is the lexeme, which is analogous to what is categorised as a "lemma" in Wiktionary. A lexeme has a lexical category ("part of speech" in Wiktionary jargon) and a language. It can also have properties assigned to it. These encompass anything that could apply to the entire lexeme, as opposed to its senses or forms. This includes etymology, grammatical gender, inflectional class and so on.

A new lexeme is created with Special:NewLexeme, which creates lexeme with just the basic data. More information can be added after you create it. Be sure to check if the lexeme already exists before you create it. If you accidentally create a duplicate lexeme, you can use Special:MergeLexemes to merge them again.

Lemmas[edit]

Whereas Wiktionary's "lemma" concept corresponds to the Wikidata "lexeme", Wikidata's own term "lemma" is what one might call the "lemma form" or "dictionary form". It is a textual representation of the entire lexeme, and is generally just the same as the word or one particular inflected form of the word. In case of multiple distinct lemma forms, such as the English colour/colour/color (L1347) with its two spelling variants, each lemma is given a special language tag to indicate which variety of the language the lemma belongs to. Wikidata does not create separate lexemes for separate variants of a word, they are all subsumed under the same lexeme, because in a sense, they are the same "thing".

Etymology[edit]

English Wiktionary groups one or more lemmas (lexemes) under an etymology. Wikidata does it the other way around: etymology is indicated using properties attached to lexemes.

The derived from lexeme (P5191) property indicates a generic derivation from another lexeme, and is like Wiktionary's {{derived}} template. Like its Wiktionary counterpart, it is unspecific about the kind of derivation. Wikidata does not have redlinks, which means that to indicate a derivation from another lexeme, an entry for that lexeme has to already exist. However, you are not required to specify anything about that lexeme other than a language and part of speech, so you can create a "stub" lexeme for you or others to fill in later.

If you want to further specify the type of derivation, you can use the mode of derivation (P5886) property. This is a qualifier property, which is a property that's attached to another property, in this case to the generic "derived from" property. You give it a Wikidata item which represents the kind of derivation. Some Wiktionary templates, with the equivalent Wikidata item:

Straightforward combinations of morphemes, like {{compound}}, {{affix}}, {{prefix}}, {{suffix}} and other templates, are all handled with the combines lexemes (P5238) property. You can provide multiple values for this property, each representing one of the lexemes from which this one was created. When a property has multiple values, these values are considered to have no particular order by default, so you also need to add a qualifier property series ordinal (P1545) to each of the parts, to indicate the order in which the parts of the word were combined. You can see this on long-legged (L23252).

Gender and number[edit]

Grammatical gender is entered with another property on the lexeme, grammatical gender (P5185). As the value, you give it an item representing the gender. Some common ones are:

Gender and number are not combined on Wikidata, but are considered entirely separate. To indicate that a lexeme behaves unusually with respect to number, you can use instance of (P31) with one of these items as the value:

Senses[edit]

Wikidata does allow adding senses to lexemes. These are free text entries in a given language.

Inflection[edit]

Inflection on Wikidata is split into two different concepts.

On one hand, lexemes can be assigned a particular class. In Wiktionary terms, this can be thought of as choosing a particular inflection-table template and supplying it with arguments. The class is specified using a property on the lexeme, and can be (property to be created) (for any lexical category) or conjugation class (P5186) (for verbs specifically). The value of this property can be any item that conceptually identifies the type of inflection/conjugation. For example, a German verb might use Germanic weak conjugation (Q56651357), a Latin verb might use first conjugation impersonal verb (Q53768605). Your language may need (or already have) its own specific inflection/conjugation class items, or it could make use of more generic ones that can apply to many languages at once. To specify further details of the inflectional pattern, such as whether it has any irregularities or a particular grammatical quirk, you can add qualifiers to the inflection/conjugation class property, or you can create a separate item for it.

The other side of inflection is the inflected forms themselves. In Wiktionary, these are displayed in an inflection table or in the headword line, and then separate pages are created for each of the forms, which links them back to the lemma. On Wikidata, there are no separate pages for forms, instead they are attached to a lexeme. A lexeme's form has one or more representations, which show how the form is attested in a particular language variety or instance. These work pretty much the same as lemmas do. A form also has a set of grammatical features, which indicate (using items) what the form actually represents, grammatically, such as plural (Q146786), neuter (Q1775461), third person (Q51929074), imperative (Q22716), present tense (Q192613) and so on. Each grammatical facet of a form can be specified with an individual item. Finally, forms can have their own properties, which can indicate various aspects of that specific form.

Some more points to mention about forms:

  • Every lexeme should have at least one form, representing the lemma form. If the lexeme only has that one form, like an English preposition for example, then the form should have no grammatical features.
  • Grammatical features are a logical conjunction, connected with AND: they all apply at the same time. One should never have to choose between alternative grammatical features. If you find that is the case (that is, there is syncretism), then they should be entered as separate forms, each with its own set of grammatical features. For example, although the English word walked is both the past tense form and the past participle form, it is entered on Wikidata as two separate forms.
  • Multiple forms can have the same set of grammatical features. This indicates that the forms are alternatives, and can be used interchangeably.
  • Forms are numbered sequentially in the order of creation. There is no significance to these numbers, so don't worry about them. At the same time, don't assume that a particular form has a particular number; the lemma form may not always be F1!

Pronunciation[edit]

On Wiktionary, pronunciation is specified separately from the lemma, in its own section, and multiple words sharing the same pronunciation are often grouped under a common pronunciation section. In Wikidata, the approach is rather different, but more practical: pronunciation is attached only to forms, not to lemmas. It is indicated using a property on the particular form it applies to. For pronunciation in the International Phonetic Alphabet, the property IPA transcription (P898) is used; it is equivalent to Wiktionary's widely-used {{IPA}} template. Properties for other transcription schemes may also exist, or may need to be created first.

Because only forms have pronunciation, rather than lexemes, this explains why every lexeme must have at least one form. This form represents the lemma form, and must also hold the pronunciation of that form.

"Display forms" and dictionary-only diacritics[edit]

Wiktionary has the concept of a "display form" of a term: a form that includes special diacritics and other marks that are not used in the normal written form, and therefore not in the names of pages, but are a useful indication of pronunciation. This form may be used in the headwords or with a linking template, and is automatically converted by linking templates to the plain written form. This concept also exists on Wikidata, although it is still infrequently used.

You can enter it by creating a second representation of a form (not a lemma), and attach the special code -x-Q7249970 (referring to pronunciation respelling (Q7249970)) to the language code. Then fill in the form with the dictionary-only marks. An example can be seen at dictionarium (L24786).

Language codes and language treatment[edit]

Wiktionary makes much more use of language codes than Wikidata does. The primary means of identifying a language is through an item that represents that language, such as English (Q1860). The item is used to specify the language of a lexeme. This means that any language is possible, as long as it has an item. There are no specific rules about language treatment, and whether something is a language or subsumed as a dialect under another language. This means that both e.g. Croatian (Q6654) and Serbo-Croatian (Q9301) can be used as languages. It is expected that there will be future efforts to coordinate this, to avoid redundancy.

The language of a lemma or representation (for a form) is indicated using a language code, as on Wiktionary, but the number of language codes on Wikidata is rather restricted. The majority of ISO-639 codes is not available. If the language code for your language is not recognised, you should use the catch-all code mis (for "miscellaneous").

Criteria for inclusion[edit]

Wikidata does not yet have an official policy regarding inclusion criteria, analogous to wikt:WT:CFI. There is, however, a draft document at Wikidata:Lexicographical data/Notability. It uses "notability" as a criterium for inclusion, which Wiktionary does not use and may give the wrong idea. This is really just terminology: "notable" is Wikidata's jargon for "includable". In practice, the criteria are similar to what is found on English Wiktionary and probably do not form a barrier to entry.

Language-specific considerations[edit]

Wiktionary describes guidelines and consensus regarding individual languages at pages named "About (language)", like wikt:Wiktionary:About Spanish. Wikidata's lexicographical project has equivalent pages, listed at Wikidata:Lexicographical data/Documentation/Languages. Be sure to have a look there before you start editing in a given language.