Wikidata:Property proposal/has code
has code[edit]
Originally proposed at Wikidata:Property proposal/Generic
Description | code for |
---|---|
Data type | Number (not available yet) |
Domain | character encoding (Q184759) |
Allowed units | none |
Example | ; |
See also | Ascii code (Par exemple: vérifier les autres propriétés afin d'être cohérent, collecter des données, automatiser un lien externe, etc.) |
code for Support as this is more generic. CC0 (talk) 18:51, 30 October 2016 (UTC)
- Comment My gut feeling is that this would be better modelled as a claim on the item about the Unicode character, e.g. . Thryduulf (talk) 21:18, 16 November 2016 (UTC)
- Oppose A Unicode character encoding like UTF-8, UTF-16, UTF-32 has over 100,000 assigned characters. Adding that many claims to an item is nonsensical (and is probably going to break the Wikidata software.) This proposal only really makes sense when you consider small character encodings like Latin1, but if the approach doesn't work for Unicode I don't think it is worth proceeding with. SJK (talk) 14:04, 10 January 2017 (UTC)
- Oppose in the current suggested form as code (P3295)/encoding (P3294) and Unicode character (P487) are already being used on items which are instance of (P31) letter (Q9788) and this appears to be the saner approach as per @SJK:'s comment. Dhx1 (talk) 13:05, 8 February 2017 (UTC)
Not done No support.--Micru (talk) 07:41, 24 April 2017 (UTC)
encodes[edit]
Originally proposed at Wikidata:Property proposal/Generic
Description | qualifier to link a code to the item of the character it encodes, if relevant. |
---|---|
Data type | Number (not available yet) |
Domain | character encoding (Q184759) |
Allowed units | none |
Example | see above example |
- Motivation
Compared to the other attempt, this property is restricted in domain and is in reverse sense : from character set encoding to the caracter, and not from character to code. This mean a letter can has as many code we want without being polluted by as many statements as there is character sets. There is also a qualifier to link to the item of the character if relevant. author TomT0m / talk page 20:26, 12 October 2016 (UTC)
- Discussion
Comment I imagine that the character set properties would get quite polluted themselves, most especially for a set like GB 2312 (Q1421973), Big5 (Q858372), Shift JIS (Q286345) or KS X 1001 (Q489423). As for pollution of the items for the characters themselves--if anything, numerous codepages may end up sharing a codepoint for a single character, so the main risk is just a lot of qualifiers on a single code. When Wikidata-Wiktionary integration goes live, it may become much more beneficial to add character codes to items (especially Chinese characters) rather than to character sets. Mahir256 (talk) 01:57, 15 October 2016 (UTC)
- Good point. Maybe another relationship should be used to minimize redundancy like "share/duplicates codes with/of", qualified by intervals of shared codes. Many ISO european standards charater sets shares a lot with ascii, that would avoid a lot of claims if we avoid duplicating all the shared codes with ascii. Do you know if there is a similar situation on oriental character sets ? Arabian and Cyrillic characters sets should be in a similar situation as there is about the same number of letters in arab alphabet and in latin one. Japan and China shares a lot of characters ... author TomT0m / talk page 08:28, 15 October 2016 (UTC)
Support because it will be easy to search for the corresponding code point in different charsets. CC0 (talk) 18:57, 30 October 2016 (UTC)
- I prefer the other way round, i.e. having a property to link characters with character encodings. Some character encoding sets can have many codes, e.g. UTF-8 encodes for all characters you can think of.
- Number datatype only supports numbers in the decimal format, however, many codes are written down in hexadecimal. It might be easier to use hex here too, i.e. using string datatype. --Pasleim (talk) 11:43, 26 December 2016 (UTC)
Oppose I'm not sure if this is necessary a good fit for Wikidata. Unicode character (P487) already handles the mapping from characters to Unicode codepoints, although it would be far better if that was done as a number instead (e.g. "has Unicode codepoint" property) – P487 is very user-unfriendly when it comes to non-printing characters like zero-width non-joiner (Q863569). But, as far as character encodings in general go – some of them are massive (with many thousands of code points) and can have quite complex encoding rules. How would you use this property with something like UTF-EBCDIC (Q718092)? It has over 100,000 code points (the same as Unicode) but it encodes them via complex rules as multi byte strings. But, does Wikidata really need to store this info? If we store the mapping of items to Unicode code points, then there are numerous software libraries (e.g. International Components for Unicode (Q823839)) which know how to map umpteen legacy character sets to/from Unicode. Trying to stuff that knowledge into Wikidata, while it might work for simple 8-bit character sets like ISO/IEC 8859-1 (Q935289), it won't work with complex multibyte character sets which are described as much by algorithms as by tables – it isn't Wikidata's job to store algorithms. SJK (talk) 14:20, 10 January 2017 (UTC)
- @SJK: I think there is a misanderstanding here. I thought the existing property were to link the concept to its character. Not the codepoint. What did I miss ? Plus there is not ony unicode at sake, and the mere existence of a property does not imply we use it exhaustively and not case by case. These global opposal will mean we will have nothing overall for non unicode charset, hence we won't be able to store anything - other proposals have already been rejected. author TomT0m / talk page 19:36, 10 January 2017 (UTC)
- @TomT0m: You want Wikidata to say that "character encoding ISO/IEC 8859-1 (Q935289) (Latin1) encodes character É (Q9995) using the single byte E9". My point is that simple 8-bit encodings like Latin1 it is basically an arbitrary mapping from characters to single bytes which is easily expressed in a table. But for more complex encodings, e.g. multibyte encodings, it is much more complex. Do you want to say that in UTF-8 (Q193537) that character is encoded by two bytes C3 89? Well, then you can't really use "Number" for that property, you need something like a hex string. But why store that, when it is easy to programmatically derive it from the Unicode code point number using readily available libraries? Someone could write an external app which pulls the Unicode code point from Wikidata and then displays the encoding in various character encodings. Also, there are two other complexities to consider (a) UTF-8 is a stateless encoding so the same code point always encodes to the same byte sequence; but other multibyte encodings are stateful so which byte sequence a code point encodes to depends on the current encoding state; (b) what is considered a single character from a human perspective may actually be expressed in Unicode by multiple combining characters, so how will you model that? Basically, I think like this is a very complex area, and this proposal fails to properly take into account the complexities involved. SJK (talk) 20:39, 10 January 2017 (UTC)
- @SJK: Well, this works for a lot of usecases, let's think of other ways to model more complex situations like unicode later. You'll notice that unicode has its own dedicated properties. I therefore don't think this is a valid argument not to do this. Let's focus on situation on there this works before - there is a lot of usecases and less complex character sets that can be modeled that way. I think we all know that unicode is complex. So ? ;) author TomT0m / talk page 08:03, 11 January 2017 (UTC)
- @TomT0m: I think if this is going to be done, I think at a minimum (1) the "has encoding" property should be on the character not on the character set (2) it should have type String with a regex constraint of
[0-9A-F]{2}+
, not type Number (3) it should have a compulsory qualifier "in character set" of type Item pointing to the character set in which this character has this encoding. The reasons for this is (1) a single encoding may contain many thousands of characters, whereas there aren't many thousands of encodings, so better 1 "has encoding" property on each of a thousand characters than 1000 properties on the encoding item (2) for proper generality, the encoding should be a byte string not a number. For example, the Greek letter Ω (Q9890) has UTF-8 encodingCEA9
(two bytes). While you could strictly speaking display that as the number 0xCEA9 or 52905 decimal, absolutely nobody does that. So, I'd suggest you might want to consider reformulating your proposal along the lines I suggest. SJK (talk) 03:18, 15 January 2017 (UTC)
- @TomT0m: I think if this is going to be done, I think at a minimum (1) the "has encoding" property should be on the character not on the character set (2) it should have type String with a regex constraint of
- @SJK: Well, this works for a lot of usecases, let's think of other ways to model more complex situations like unicode later. You'll notice that unicode has its own dedicated properties. I therefore don't think this is a valid argument not to do this. Let's focus on situation on there this works before - there is a lot of usecases and less complex character sets that can be modeled that way. I think we all know that unicode is complex. So ? ;) author TomT0m / talk page 08:03, 11 January 2017 (UTC)
- @TomT0m: You want Wikidata to say that "character encoding ISO/IEC 8859-1 (Q935289) (Latin1) encodes character É (Q9995) using the single byte E9". My point is that simple 8-bit encodings like Latin1 it is basically an arbitrary mapping from characters to single bytes which is easily expressed in a table. But for more complex encodings, e.g. multibyte encodings, it is much more complex. Do you want to say that in UTF-8 (Q193537) that character is encoded by two bytes C3 89? Well, then you can't really use "Number" for that property, you need something like a hex string. But why store that, when it is easy to programmatically derive it from the Unicode code point number using readily available libraries? Someone could write an external app which pulls the Unicode code point from Wikidata and then displays the encoding in various character encodings. Also, there are two other complexities to consider (a) UTF-8 is a stateless encoding so the same code point always encodes to the same byte sequence; but other multibyte encodings are stateful so which byte sequence a code point encodes to depends on the current encoding state; (b) what is considered a single character from a human perspective may actually be expressed in Unicode by multiple combining characters, so how will you model that? Basically, I think like this is a very complex area, and this proposal fails to properly take into account the complexities involved. SJK (talk) 20:39, 10 January 2017 (UTC)
- Oppose with the suggested domain of character encoding (Q184759). If however the domain of this property proposal is changed to natural number (Q21199) then I think a case may exist for this property proposal (perhaps with another name?). Dhx1 (talk) 13:07, 8 February 2017 (UTC)
- @Dhx1: I am unconvinced that natural number (Q21199) is the right domain. Take for example em dash (Q10941604), which has Unicode codepoint U+2024 and UTF-8 encoding E28094. Are you saying that 8228 (Q19261720) (which is decimal for hex 2024) should have a property pointing to em dash (Q10941604)? Or, similarly, a new item to represent the natural number 14844052 (which is decimal for hex E28094) with a property pointing to em dash (Q10941604)? I think the domain should be character (Q32483) or character (Q3241972) not natural number. Also, the property data type should be string rather than number or item. This is because, while it might be strictly speaking mathematically correct that E28094 is equivalent to 14844052, no one ever writes UTF-8 in that way. E28094 is not a single number, it is actually a sequence of 3 8-bit numbers. Even though mathematically any sequence of integers is equivalent to one single integer, humans don't usually think that way outside of certain limited mathematical contexts (such as Gödel numbering (Q1451046)). SJK (talk) 21:11, 8 February 2017 (UTC)
Not done No support.--Micru (talk) 07:44, 24 April 2017 (UTC)