Wikidata:Property proposal/has code

From Wikidata
Jump to navigation Jump to search

has code[edit]

Originally proposed at Wikidata:Property proposal/Generic

   Not done
Descriptioncode for
Data typeNumber (not available yet)
Domaincharacter encoding (Q184759)
Allowed unitsnone
Example
⟨ ISO/IEC 8859-1 (Q935289)  View with Reasonator View with SQID ⟩ has code Search ⟨ 0xE9 ⟩
Unicode character (P487) View with SQID ⟨ é ⟩
 ;
⟨ ISO/IEC 8859-1 (Q935289)  View with Reasonator View with SQID ⟩ has code Search ⟨ 0xE9 ⟩
codes for Search ⟨ Q9995 ⟩
See alsoAscii code (Par exemple: vérifier les autres propriétés afin d'être cohérent, collecter des données, automatiser un lien externe, etc.)

code for  Support as this is more generic. CC0 (talk) 18:51, 30 October 2016 (UTC)[reply]

 Not done No support.--Micru (talk) 07:41, 24 April 2017 (UTC)[reply]

encodes[edit]

Originally proposed at Wikidata:Property proposal/Generic

   Not done
Descriptionqualifier to link a code to the item of the character it encodes, if relevant.
Data typeNumber (not available yet)
Domaincharacter encoding (Q184759)
Allowed unitsnone
Examplesee above example
Motivation

Compared to the other attempt, this property is restricted in domain and is in reverse sense : from character set encoding to the caracter, and not from character to code. This mean a letter can has as many code we want without being polluted by as many statements as there is character sets. There is also a qualifier to link to the item of the character if relevant. author  TomT0m / talk page 20:26, 12 October 2016 (UTC)[reply]

Discussion

 Comment I imagine that the character set properties would get quite polluted themselves, most especially for a set like GB 2312 (Q1421973), Big5 (Q858372), Shift JIS (Q286345) or KS X 1001 (Q489423). As for pollution of the items for the characters themselves--if anything, numerous codepages may end up sharing a codepoint for a single character, so the main risk is just a lot of qualifiers on a single code. When Wikidata-Wiktionary integration goes live, it may become much more beneficial to add character codes to items (especially Chinese characters) rather than to character sets. Mahir256 (talk) 01:57, 15 October 2016 (UTC)[reply]

Good point. Maybe another relationship should be used to minimize redundancy like "share/duplicates codes with/of", qualified by intervals of shared codes. Many ISO european standards charater sets shares a lot with ascii, that would avoid a lot of claims if we avoid duplicating all the shared codes with ascii. Do you know if there is a similar situation on oriental character sets ? Arabian and Cyrillic characters sets should be in a similar situation as there is about the same number of letters in arab alphabet and in latin one. Japan and China shares a lot of characters ... author  TomT0m / talk page 08:28, 15 October 2016 (UTC)[reply]

 Support because it will be easy to search for the corresponding code point in different charsets. CC0 (talk) 18:57, 30 October 2016 (UTC)[reply]

  1. I prefer the other way round, i.e. having a property to link characters with character encodings. Some character encoding sets can have many codes, e.g. UTF-8 encodes for all characters you can think of.
  2. Number datatype only supports numbers in the decimal format, however, many codes are written down in hexadecimal. It might be easier to use hex here too, i.e. using string datatype. --Pasleim (talk) 11:43, 26 December 2016 (UTC)[reply]

 Oppose I'm not sure if this is necessary a good fit for Wikidata. Unicode character (P487) already handles the mapping from characters to Unicode codepoints, although it would be far better if that was done as a number instead (e.g. "has Unicode codepoint" property) – P487 is very user-unfriendly when it comes to non-printing characters like zero-width non-joiner (Q863569). But, as far as character encodings in general go – some of them are massive (with many thousands of code points) and can have quite complex encoding rules. How would you use this property with something like UTF-EBCDIC (Q718092)? It has over 100,000 code points (the same as Unicode) but it encodes them via complex rules as multi byte strings. But, does Wikidata really need to store this info? If we store the mapping of items to Unicode code points, then there are numerous software libraries (e.g. International Components for Unicode (Q823839)) which know how to map umpteen legacy character sets to/from Unicode. Trying to stuff that knowledge into Wikidata, while it might work for simple 8-bit character sets like ISO/IEC 8859-1 (Q935289), it won't work with complex multibyte character sets which are described as much by algorithms as by tables – it isn't Wikidata's job to store algorithms. SJK (talk) 14:20, 10 January 2017 (UTC)[reply]

@SJK: I think there is a misanderstanding here. I thought the existing property were to link the concept to its character. Not the codepoint. What did I miss ? Plus there is not ony unicode at sake, and the mere existence of a property does not imply we use it exhaustively and not case by case. These global opposal will mean we will have nothing overall for non unicode charset, hence we won't be able to store anything - other proposals have already been rejected. author  TomT0m / talk page 19:36, 10 January 2017 (UTC)[reply]
@TomT0m: You want Wikidata to say that "character encoding ISO/IEC 8859-1 (Q935289) (Latin1) encodes character É (Q9995) using the single byte E9". My point is that simple 8-bit encodings like Latin1 it is basically an arbitrary mapping from characters to single bytes which is easily expressed in a table. But for more complex encodings, e.g. multibyte encodings, it is much more complex. Do you want to say that in UTF-8 (Q193537) that character is encoded by two bytes C3 89? Well, then you can't really use "Number" for that property, you need something like a hex string. But why store that, when it is easy to programmatically derive it from the Unicode code point number using readily available libraries? Someone could write an external app which pulls the Unicode code point from Wikidata and then displays the encoding in various character encodings. Also, there are two other complexities to consider (a) UTF-8 is a stateless encoding so the same code point always encodes to the same byte sequence; but other multibyte encodings are stateful so which byte sequence a code point encodes to depends on the current encoding state; (b) what is considered a single character from a human perspective may actually be expressed in Unicode by multiple combining characters, so how will you model that? Basically, I think like this is a very complex area, and this proposal fails to properly take into account the complexities involved. SJK (talk) 20:39, 10 January 2017 (UTC)[reply]
@SJK: Well, this works for a lot of usecases, let's think of other ways to model more complex situations like unicode later. You'll notice that unicode has its own dedicated properties. I therefore don't think this is a valid argument not to do this. Let's focus on situation on there this works before - there is a lot of usecases and less complex character sets that can be modeled that way. I think we all know that unicode is complex. So ? ;) author  TomT0m / talk page 08:03, 11 January 2017 (UTC)[reply]
@TomT0m: I think if this is going to be done, I think at a minimum (1) the "has encoding" property should be on the character not on the character set (2) it should have type String with a regex constraint of [0-9A-F]{2}+, not type Number (3) it should have a compulsory qualifier "in character set" of type Item pointing to the character set in which this character has this encoding. The reasons for this is (1) a single encoding may contain many thousands of characters, whereas there aren't many thousands of encodings, so better 1 "has encoding" property on each of a thousand characters than 1000 properties on the encoding item (2) for proper generality, the encoding should be a byte string not a number. For example, the Greek letter Ω (Q9890) has UTF-8 encoding CEA9 (two bytes). While you could strictly speaking display that as the number 0xCEA9 or 52905 decimal, absolutely nobody does that. So, I'd suggest you might want to consider reformulating your proposal along the lines I suggest. SJK (talk) 03:18, 15 January 2017 (UTC)[reply]


 Not done No support.--Micru (talk) 07:44, 24 April 2017 (UTC)[reply]