Wikidata:Property proposal/magic numbers
File format magic numbers
[edit]Description | magic numbers used to incorporate file format metadata in form of a string coded hexadecimal number (usual encoding, "0" = 0 and "F" = 15, space ignored). Qualifiers can specify an offset and a padding value for this number. |
---|---|
Data type | String |
Template parameter | Template:Infobox file format (Q10986167) magic number parameter |
Domain | file format (Q235557) |
Example | GIF (Q2192) -> 47 49 46 38 39 61 |
Source | Gary Kessler's File Signatures Table |
Planned use | I plan to add magic numbers to Wikidata items for the corresponding file formats. |
- Motivation
Magic numbers are constant numerical or text values used to identify file formats. Having this data in Wikidata will help make file format information more complete. Magic numbers are part of how we verify file signatures and are used in forensic computing. This is also a parameter of the Infobox:File format. It will be possible to transfer all of the magic numbers stored in infoboxes to Wikidata if we create this property. There is also this list [List of file signatures] that we could transfer to Wikidata. YULdigitalpreservation (talk) 15:43, 17 October 2016 (UTC)
Qualifier
[edit](Additions to the proposal by TomT0m)
Offset
[edit]Description | qualifier of "magic number" for the number of bytes before the magic number to be searched in a file |
---|---|
Data type | Number (not available yet) |
Example | Modelling the format "RVT" "[512 (0x200) byte offset]
00 00 00 00 00 00 00 00 [512 (0x200) byte offset]
........
RVT Revit Project File subheader
|
Source | Gary Kessler's File Signatures Table |
Planned use | qualifier for the property above |
Talk
[edit]SupportWait. author TomT0m / talk page 17:31, 17 October 2016 (UTC)- Support but there are several other kinds of "magic numbers" so I think the name needs to be more descriptive - maybe "file format magic numbers"? Also, with string value isn't there some room for ambiguity in how the numbers are to be represented here? ArthurPSmith (talk) 18:18, 17 October 2016 (UTC)
- Good point. If the numbers are string coded hexadecimal, this should be made explicit. The spaces seems also totally irrelevant and adds burden to parse. I can also see in the files that spec also specifies offsets : [11 byte offset] and [512 (0x200) byte offset]. This could be handled better than with an unspecified string format in a structured data projects. Also see if the string can't encode the non-hex version such as directly the string, for example in
46 41 58 43 4F 56 45 52 FAXCOVER 2D 56 45 52 -VER
- it should be possible to store more efficiently directly "FAXCOVER-VER", maybe an offset with a qualifier,
and maybe a "padding value" also with a qualifier, something like. author TomT0m / talk page 18:32, 17 October 2016 (UTC)
- Comment Thanks for this feedback. I revised the label for the property proposal. It looks like we will need a hexadecimal option as well as an ascii option. I welcome suggestions of how to further refine the proposal. YULdigitalpreservation (talk) 13:45, 18 October 2016 (UTC)
- Support It will be very useful for data regarding file type identification. CC0 (talk) 11:45, 28 October 2016 (UTC)
- Comment Could this property be specified to contain values which are Perl Compatible Regular Expressions (PCRE), allowing for more advanced signatures to be specified if desired? For example, "\x89PNG\x0D\x0A\x1A\x0A" for the PNG family, "\x00\x01\x00\x00Standard Jet DB" for Microsoft Access MDB, "GIF8[79]a" for the GIF family, etc. The advantages are: for ASCII-only-signatures (GIF), it's human-readable. For signatures containing binary/non-ASCII data (PNG), it's in a readily usable format (C/C++ strings for example) and for optionally complex signatures, it's in a format ready to use with a PCRE compliant parser. Pixeldomain (talk) 02:44, 17 November 2016 (UTC)
- Comment The offset could be identified in the PCRE expression, as an example: "(?s)^\x00\x01\x02.{38}ANSWERTOEVERYTHING" would look from the start of the file for \x00\x01\x02 then skip 38 bytes to offset 42 in the file where it would look for "ANSWERTOEVERYTHING". More advanced expressions could look at bytes from the end of the file (ZIP archives have a central directory tacked on the end of the file), perform negative look-aheads, etc. Whilst there is extra complexity with PCRE, it does not have to be used, and the fall-back is a simple C/C++ string representing binary data. Pixeldomain (talk) 03:09, 17 November 2016 (UTC)
- Comment Also worth taking a look at is how the magic file of the "file" command stores file type signatures: https://github.com/file/file/tree/master/magic/Magdir Pixeldomain (talk) 03:32, 17 November 2016 (UTC)
- Comment Also take a look at the FIDO PRONOM database at https://raw.githubusercontent.com/openpreserve/fido/af3fc47791855ad7b955eb4272411113bfcff54d/fido/conf/formats-v88.xml which uses PCRE to define signatures for each file type. Pixeldomain (talk) 04:04, 17 November 2016 (UTC)
- @Pixeldmain, cc0, YULdigitalpreservation, TomT0m, ArthurPSmith: what is the status of this proposal? Thryduulf (talk) 16:32, 22 April 2017 (UTC)
- @Pixeldomain, CC0: (fixing pings) - obviously there was some debate here about the string format for this property. Of the proposals for format, I think the PCRE idea has a lot of merit. But I'd be ok with the original space-separated hexadecimal pairs too. No strong preference. ArthurPSmith (talk) 13:24, 24 April 2017 (UTC)
- Comment @ArthurPSmith: My current view is that magic numbers or patterns are not a good property for a file format. See use of described at URL (P973) on GIF (Q2192) for an example of an alternative approach I prefer for the identification and description of file formats. Pixeldomain (talk) 01:31, 26 April 2017 (UTC)
- that means relying on a third party to provide the actual details of the format, but in some cases that may be all we have, so it's at least a good option to use. I still think a wikidata property specifically for something like this is useful though. ArthurPSmith (talk) 13:37, 26 April 2017 (UTC)
- An alternative I have also considered is detailing the structure of file formats on Wikidata by creating items for each data structure and field within each format. This moves the data from external sources into Wikidata, whilst allowing external references and sources (typically international standards, RFCs, etc) be used to describe each new item. Do you have any thoughts on this possible approach? Pixeldomain (talk) 00:58, 27 April 2017 (UTC)
- I'd probably want to see an example? Would there be several additional properties needed to do that, or can you make do with existing properties? ArthurPSmith (talk) 19:25, 27 April 2017 (UTC)
- Comment @ArthurPSmith: My current view is that magic numbers or patterns are not a good property for a file format. See use of described at URL (P973) on GIF (Q2192) for an example of an alternative approach I prefer for the identification and description of file formats. Pixeldomain (talk) 01:31, 26 April 2017 (UTC)
- @Pixeldomain, CC0: (fixing pings) - obviously there was some debate here about the string format for this property. Of the proposals for format, I think the PCRE idea has a lot of merit. But I'd be ok with the original space-separated hexadecimal pairs too. No strong preference. ArthurPSmith (talk) 13:24, 24 April 2017 (UTC)
WikiProject Informatics has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. Are there additional opinions about whether we should implement this property? ChristianKl (talk) 20:32, 24 May 2017 (UTC)
- Support This property is really needed for file formats. Looking aback to the discussion, the best approach seems to be using PCRE. One last aspect may be adding a qualifier to add a weight or a probability on the proposed PCRE. This is a common practice in implementing format identification from magic numbers. Toto256 (talk) 20:51, 24 May 2017 (UTC)
- Support I agree, this is extremely helpful. Using PCRE seems to be a helpful solution (the "offset" property would not be necessary, in this case) but should be explained to contributors who could be surprised by strings like "^PDF". --Dipsode87 (talk) 12:52, 31 July 2017 (UTC)
- Support Mahir256 (talk) 16:16, 2 August 2017 (P4152* *
- @Mahir256, Toto256, ArthurPSmith, Pixeldomain, YULdigitalpreservation, TomT0m: Done Created as file format identification pattern (P4152) and offset (P4153). ChristianKl (talk) 17:40, 6 August 2017 (UTC)