Wikidata:Property proposal/Unicode character (item)

From Wikidata
Jump to navigation Jump to search

Unicode character (item)[edit]

Originally proposed at Wikidata:Property proposal/Generic

   Not done
DescriptionUnicode character representing the item
Data typeItem
Domaininstance of (P31)subclass of (P279)* → letter (Q9788)
Example 1A (Q9659) → See Q9659#P487, each value will be a new item
Example 2B (Q9705) → See Q9705#P487, each value will be a new item
Example 3C (Q9820) → See Q9820#P487, each value will be a new item
Planned usenew items will be created for current values of Unicode character (P487) on instance of (P31) of letter (Q9788). The current values of Unicode character (P487) and Unicode code point (P4213) will be moved to these new items.
See alsocode (P3295), Unicode character (P487) and Unicode code point (P4213)

Motivation[edit]

Currently Wikidata does not differ abstract symbols A (Q9659) and specific characters representing the symbols. So it may be meaningful to create new items for these Unicode characters. Unicode character (P487) and Unicode code point (P4213) will be moved to these new items and then a single value constraint will be set.

Note this does not affect any item about single characters like (Q3595028).GZWDer (talk) 22:02, 11 March 2020 (UTC)[reply]

Discussion[edit]

  •  Comment I could never really figure out the purpose of Unicode character (P487). It's being used in at least four or five different ways. The above would fix that and it could become an external-id property. --- Jura 21:30, 12 March 2020 (UTC)[reply]
  •  Strong support I never understood why Unicode characters were mixed with the glyphs and concepts they represented. --Tinker Bell 06:16, 15 March 2020 (UTC)[reply]
    • @Tinker Bell: It's a little 'meta' I think, but I feel like I don't understand what is the actual subject of an item that is "about" a Unicode character. GZWDer's proposal is, I think, only to use this property where a current item has more than one Unicode character value. So for example for Chinese characters, there is only 1 Unicode character, so the item and the Unicode character are equivalent. Does that mean the "concept" of that character and the Unicode character are the same, or distinct? For the letter 'A' example, Unicode differentiates upper- and lower-case, and also those other special conditions that are sort of the letter 'A' in other contexts. So in each case where a new item would be created, that item would be "about" the conceptual context of the use of that letter, not specifically or exclusively about it as a Unicode character. Right? Or is that not the point here? ArthurPSmith (talk) 18:37, 16 March 2020 (UTC)[reply]
  •  Question can we change this to lexeme datatype per suggestion above? --- Jura 02:06, 28 March 2020 (UTC)[reply]
    • I don't think so - a lexeme may still cover multiple character or sequences of characters. For example ? have seven characters; but they should be in one (translingual) lexeme unless thay are semantically different.--GZWDer (talk) 09:13, 29 March 2020 (UTC)[reply]
      • Well, the idea is to use lexemes like invalid ID (L61046) mentioned above. They would exactly be that. --- Jura 09:15, 29 March 2020 (UTC)[reply]
        • In addition, lexemes can not handle characters not in Unicode normalized form, like 著 (U+FA5F) (Q55726748) and 著 (U+2F99F) (Q55738328). I don't think we should have lexemes for them as they have no independent meaning.--GZWDer (talk) 09:22, 29 March 2020 (UTC)[reply]
          • It's possible that initially not all can be included. The namespace is still under development and eventually a way can be found. We didn't use items either when no lexemes were available. As each character has a definition, this can be included as S1. The problem with using items is that they require needless repetition of labels and descriptions. Lexemes have all that already included. --- Jura 09:28, 29 March 2020 (UTC)[reply]
            • However still some lexemes for symbols may cover multiple characters such as X (L19342). I don't see the point for creating additional lexemes for individual characters with no additional meaning.--GZWDer (talk) 09:43, 29 March 2020 (UTC)[reply]
              • I don't think existing entities in some languages should be replaced. They can use the proposed property to point to entities like invalid ID (L61046) as well. I don't think the question whether or not to create these is much different from the question of creating them as items. If you don't see the point of one, it's unclear why you would want to create the others. Given the 5 or so ways Unicode character (P487) is used, people clearly have problems with the current structure and the more formal approach of the L-namespace could help. --- Jura 09:52, 29 March 2020 (UTC)[reply]
                • An item may easily tie to an specific Abstract Character, while a lexeme is a unit of lexical meaning, comprising a set of Abstract Characters with same semantic meaning. I don't think we should have lexemes for characters with no independent semantic meaning. For CJKV characters, I do not favor creating translingual lexemes for them - English Wiktionary deprecated translingual definitions long ago. --GZWDer (talk) 10:35, 29 March 2020 (UTC)[reply]

────────────────────────────────────────────────────────────────────────────────────────────────────

  • Can you explain what you think would be duplicated? How users could be confused? (L291359) explains clearly what it's about. For (Q87524936) users would have to find the right language to read the alias to understand what it's about. Seems much more confusing to me. --- Jura 04:43, 31 March 2020 (UTC)[reply]

I have following addition reasons:

  1. You can not add sitelinks to lexemes, so items like 😂 (Q33836537) and (Q3595028) will exist.
  2. Some characters have Unicode aliases (See [1] p924). aliases can not be added to lexemes either.
  3. We will anyway have lexemes for symbols like ( ) - this is a matching pair, and individual characters ( and ) - as a symbol, ( corresponding to multiple codepoints. Users may confuse the symbol with individual Unicode characters if both have lexeme.
  4. Not every Unicode character has meaning, and Unicode names are only names, which does not always tell the meaning of character (like 𗊓 (Q87589786)), and sometimes even unrelated to the meaning. They only exist as aliases, not as definitions.

--GZWDer (talk) 19:19, 31 March 2020 (UTC)[reply]

Each item is about an "Abstract Character" which may be encoded in multiple codesets (including Unicode). For example, "A" is a Unicode character which is (equals to) "Abstract Character" encoded in Unicode, and the same "Abstract Character" may also be encoded elsewhere. most emojis are also "Abstract Characters", some are encoded in Unicode, some are not. There will be only one item for each "Abstract Character" wherever it is encoded. I think this property should be limited to "Abstract Character" encoded in Unicode (as unencoded "Abstract Characters" are potentially infinite - this is why we have private use area (Q11152836).)--GZWDer (talk) 20:53, 5 April 2020 (UTC)[reply]