Wikidata talk:Lexicographical data/Archive/2013/11

From Wikidata
Jump to navigation Jump to search
This page is an archive. Please do not modify it. Use the current page, even to continue an old discussion.

Support for Hartz's proposal

 Support I think it’s a reasonable way to interlink the different Wiktionary projects by one and only one Wikidata Database. It would be surely very difficult to change the current structure of the Wiktionary project databases on the fly.


Perhaps my consent is more based on hopes than on knowledge about that what really will be done. But the proposed way seems to be reasonable. My hopes are shown in the following (simplified) image:
All the different Wiktionary projects should be interlinked using WIKIDATA as one common linkage database between the projects. The links should not be established on a WIKI / page base but on WORD SENSE base. The intersection data should contain “good” example phrases (probably not only one as shown in my image), taken from the one or other side, translated into the “other” language.

I’m aware that this may be difficult because the word senses are contained in one single text block, but perhaps this information could be extracted from the tags. In the long run all redundancies should be removed from the single project databases as described in [Wiktionary future], chap “2.3 Proposal of an improved Wiktionary data model”.

Another decisive thing to me seems to be, that all these different Wiktionary projects and Wikidata are covered / supported by one common worldwide user-interface that hides these technically necessary details. The user, if English, French, German or Chinese should have the feeling to work with one and only one database.
NoX (talk) 19:38, 1 October 2013 (UTC)

You're comparing a West-Germanic language (German) and a creole (English) of a West-Germanic language and other languages (Norse and French), and even in your example you show the creole to be leading. I'm afraid this will give the same results as interwiki links on wikipedia: first "bot"s will add "equivalent" links, then those links are moved to wikidata, after that "helpful" outsiders will remove the original interwikis, and at last those correct interwikis can't be re-added because they don't fit the (brave) new worldview. When comparing languages, many words have almost the same meaning, but not completely the same meaning; grammar might look the same, but often differs in details. For interwikis, that's a (minor) problem; but translations will just be wrong. --80.114.178.7 20:27, 24 November 2013 (UTC)

Doubts/questions

Hello, I've some questions/doubts and I've not clear if these were already discussed:

  • In the example of the page a translation of a sense seems to be connected to a lexeme. IMHO this is not completely correct because a translation should be connected to the couple lexeme-sense. This is one of the big gaps we have today in the wiktionaries, we are linking to a lemma in one language but we are not able to specify the sense. This concept applies also to synonims and others.
  • I understand, in this proposal, each lexeme will have his own senses, that is:
  • 1 lexeme <-> n senses ('n' is a number >=1)
  • 1 sense <-> 1 lexeme
I also understand every sense should be translated into every possible language (i.e. have glosses in all the languages).
Now let's consider the simple sense "inhabitant of Paris". Almost every language have a specific lemma for it like English Parisian, French Parisien, Italian parigino and so on. This means we will have for each lexeme (language) a different sense (S1120, S2234, ...) and each one will have a repetition of glosses. Isn't it better to have the gloss only in the lexeme language and then, through the translation property, have a link to glosses in other languages? This is also how wiktionaries and paper dictionaries work.
Doing a practical example I'm suggesting something like the following:
  • (lexeme) W123
  • (lemma) apple
  • (lexical category) noun (i.e. Q1084)
  • (language) English (i.e. Q1860)
  • (sense) S1989
    • (gloss) (en) fruit of the apple tree
    • (translation) (de) Apfel (i.e. <W1578;S2589>)
  • (lexeme) W1578 (won't be displayed)
  • (lemma) Apfel
  • (lexical category) noun (i.e. Q1084)
  • (language) German (i.e. Q1860)
  • (sense) S2589
    • (gloss) (de) Frucht des Apfelbaumes
    • (translation) (en) apple (i.e. <W123;S1989>, the language 'en' and the lemma 'apple' are inherited by the connected lexeme)
  • I've some doubts on the lexical properties suggested for the forms. How can a simple field manage the complexity of forms like, for example, conjugations. Just looking at the Spanish verb dejar, we can see that a verb form has properties like the 'person' (i.e. first person), the number (i.e. 'singular'), the tense (i.e. 'future'), the mood (i.e. 'indicative'). Furthermore these properties I guess are not the same in every language.
  • What about the words that are not simply a 'sum of parts', like the English verb give up or the Spanish adverb sin embargo. Are they to be considered new lexemes?

I strongly believe wikidata can boost and and benefit all the wiktionaries.

Thanks,--Diuturno (talk) 14:48, 29 October 2013 (UTC)

Hello Diuturno! I share some doubts:
  • as I wrote above I have a proposal for defining senses, I also propose to use the already defined senses at OmegaWiki (example)
  • I am strictly against linking each inflected form this is a mess (think about finnish inflection!), rather I would define the way of inflection (e.g. de-verb-weak)
  • I propose to operate with e.g. separable phrasal verb like give up regulary as they would be one term.
--Bigbossfarin (talk) 18:07, 2 November 2013 (UTC)
Not having separate entries for each inflected form is completely unfeasible, as it would exclude an enormous amount of potential data regarding pronunciation, rhymes, homonyms, usage statistics, (infrequently) etymological relations, and so on, of each inflected form. --Yair rand (talk) 21:49, 3 November 2013 (UTC)
Also, inflected forms can be highly irregular, and some words can be inflected in different ways (try French verbs fr:être, fr:ouïr or nouns fr:œil). But they can still be easily described with appropriate fields (those may be specific to each language).
Phrases can be considered as lexemes if they are not sum of parts (there are some subtleties of course). This is definitely not an issue.
Finally : the defined senses from OmegaWiki is one of its main flaws: it makes the simplistic hypothesis that a given sense can be found exactly in every language, which is only true with rigorously defined terminologies (e.g. en:homoscedasticity) or extremely simple words (e.g. apple). The same goes for synonyms, in that case they would have to be exactly interchangeable in every context (they are usually not). Darkdadaah (talk) 10:27, 5 November 2013 (UTC)

My current thougths

I like how the page has taken shape over the months, divided into phase 1 and phase 2. Good development. The phase 1 is about interwiki links and the phase 2 is about all the "fancy stuff". The interwiki links are the first priority, the biggest advantage Wikidata can bring to Wiktionary quickly. The phase 2 is something that can be implemented in years to come. I'd like to see progress on the interwiki front, as we could get rid of interwiki bots in Wiktionary, pages would shorten (less bytes) and interwiki links would be up-to-date in every version. So let's get started! I don't know how essential it is to develop a new data model -- the current one could probably be used -- but yes, it's probably a better idea to not list all the identical page names, and therefore developing a better data model is justified. However, this necessitates the provision of arbitrary access to any Wikidata item from any page as said on the page. So, who are working on these Wiktionary-related things? Is it a lot of work? When can we expect the phase 1 take place? Let's think about the phase 2 after the phase 1 is implemented to keep us focused (it's a good idea to make the phase 2 compatible with phase 1, though). So, all in all, a good proposal, and the to-be-developed data model makes the implementation smarter. It's great that the idea of making Wikidata the 100% data repository of Wiktionary is ditched -- yes, it's better to take things from Wikidata as needed in different versions according to need than "replace Wiktionary with Wikidata at once". --Hartz (talk) 17:15, 6 November 2013 (UTC)

Only commenting on the timeline point: Wiktionary is the most complicated project and is currently last on my list of projects to add. I know this sucks for you but that's how things will unfortunately have to go. --Lydia Pintscher (WMDE) (talk) 17:21, 6 November 2013 (UTC)
That's expected: even interwiki (language) links are not that trivial in Wiktionaries (see previous discussions above). The other projects will probably be easier to deal with, although each will probably have its own issues (see Wikisource which seems straightforward at first glance...). I'd better wait for a well-thought plan than rush things. Darkdadaah (talk) 12:51, 7 November 2013 (UTC)

Why don't we take inspiration from how the wiktionaries were built

Dear all, I can see there's not a clear consensus on how to proceed. To be honest I don't get why we are not taking inspiration from how the wiktionaries are now built and are now working. That is: we have an entry for every sequence of letters I can type on a keyboard (in every possible alphabet) that have a meaning.

These entries are structured like the examples below and the experience tells us it's a consistent structure.

  • (lexeme) W123
  • (lemma) work
  • (language entry) L1829
    • (language) English (i.e. Q1860)
    • (etymology) From Old English weorc
    • (pronunciation) IPA: /wɜːk/
    • (lexical entry) LE2312
      • (lexical category) noun (i.e. Q1084)
        • (sense) S1989
          • (gloss) (en) Labour, employment, occupation, job
          • (translation) (de) Arbeit
    • (lexical entry) LE3891
      • (lexical category) verb (i.e. Q1089)
        • (sense) S1990
          • (gloss) (en) To do a specific task by employing physical or mental powers
          • (translation) (de) arbeiten
  • (lexeme) W12389
  • (lemma) works
  • (language entry) L12131
    • (language) English (i.e. Q1860)
    • (pronunciation) IPA: /wɜːks/
    • (lexical entry) LE2082
      • (lexical category) noun (i.e. Q1084)
      • (form of) work (LE2312)
      • (form type) plural
    • (lexical entry) LE2083
      • (lexical category) verb (i.e. Q1089)
      • (form of) work (LE3891)
      • (form type) third-person singular simple present indicative

The above is not a proposal because I don't have enough knowledge of wikidata, but I guess can have a sense if we follow this logical structure:

  • A “typeable” can have a meaning in one or more languages;
    • One language (of a typeable) has one or more lexical categories;
      • One lexical category (of a language of a typeable) has one or more meanings.

Agreeing on this will also let us start with the famous phase 1 of importing the interwiki links.

Hoping to keep the constructive discussion alive --95.23.148.44 17:43, 6 November 2013 (UTC)

The above comment was written by me while not logged--Diuturno (talk) 17:44, 6 November 2013 (UTC)

language family categories

How about linking Wiktionary's language family categories and Wikidata's items about language family categories, like wikt:en:Category:Germanic languages, wikt:nl:Categorie:Germaanse talen and Category:Germanic languages (Q8490990)? Visite fortuitement prolongée (talk) 21:24, 25 November 2013 (UTC)