Wikidata:Property proposal/ontology prefix

From Wikidata
Jump to navigation Jump to search

identifiers.org prefix[edit]

Originally proposed at Wikidata:Property proposal/Property metadata

Descriptioncode used in front of an identifier, as supported by identifiers.org and n2t.net
RepresentsIdentifiers.org (Q16335166)
Data typeString
DomainWikidata property for authority control (Q18614948) and unique identifier (Q6545185)
Allowed values[A-Za-z][_A-Za-z0-9]+
Example
Planned useadd for the identifiers supported by identifiers.org which have wikidata properties and/or items
Motivation

Our default way to store identifier information in Wikidata is to omit the ontology prefix in our data while sometimes elsewhere the prefix is used inside the identifier. It would be valuable for us to store the prefix. This would allow building a search engine where entering "FMA50801" would allow the user to find brain (Q1073).

The name of this property corresponds to the usage on http://www.ontobee.org/ ChristianKl () 14:03, 19 December 2017 (UTC)[reply]

Discussion

WikiProject Properties has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead. ChristianKl () 14:03, 19 December 2017 (UTC)[reply]

  •  Support as a property for properties (not a qualifier) as proposed. ArthurPSmith (talk) 20:51, 19 December 2017 (UTC)[reply]
    •  Comment This may be something specific to ontobee.org - "prefix" has a specific meaning within SPARQL, and I think what they are doing there is defining a list of labels that will work within a SPARQL query to define a URI as, for example, "UBERON:<ID>". If you look at the UBERON formatter URL, the actual prefix for the term in their URI's is "UBERON_", i.e. it looks like "http..../UBERON_<ID>". But that's definitely not consistent across ontologies, the formatter URL for FMA for instance looks quite different. So I'm wondering how you envision this actually working, either within wikidata or outside it... Yes this prefix value could be used to query things on ontobee.org, but is it actually useful elsewhere? ArthurPSmith (talk) 20:58, 19 December 2017 (UTC)[reply]
  • On the BFO website there's such a prefix listed for most controlled ontologies in bracket after the name. I think it's a meaningful element of a lot of controlled vocabularies. In different instances there's a "_", ":", " " or nothing between the prefix and the ID but that's not a problem. A search engine could allow all ways of searching. ChristianKl15:24, 21 December 2017 (UTC)[reply]
Ok, but then I don't see how a search engine would know to turn "FMA50801" into "FMA:50801" or whatever. I think this property might be useful if it informed OUR sparql engine (WDQS), allowing these prefixes to be used along with our existing standard prefixes; however even then I'm not sure how it would work or if it would actually be useful. I think we need some input from our search engine folks here: @Matěj Suchánek, Smalyshev (WMF), Lucas Werkmeister (WMDE): do you have any opinions on this proposal? ArthurPSmith (talk) 16:42, 21 December 2017 (UTC)[reply]
This is unrelated to WDQS, but personally, I’m deeply sceptical of this ontology prefix idea. Let’s take our Douglas Adams (Q42) as an example – the full entity URI, or concept URI, is http://www.wikidata.org/entity/Q42, which is composed of the prefix (http://www.wikidata.org/entity/, abbreviated wd: in WDQS and in our RDF exports) and the entity ID (Q42). Further splitting this up into an “ontology prefix” of just “Q” and a numeric ID of 42 would be incorrect: the Wikidata entity ID is Q42, not 42. (We also have entities whose IDs don’t start with “Q” – properties, and soon lexemes.) Going back to the two examples given here – what is the actual FMA ID of brain (Q1073), “50801” or “FMA50801”? If it’s “FMA50801”, then we should be storing that ID, even though it’s partly redundant. If it’s “50801”, then what is “FMA50801”? Is it just a loose convention that FMA IDs, when used in an ambiguous context, are written like that to make it clear that they’re FMA IDs? --Lucas Werkmeister (WMDE) (talk) 18:24, 21 December 2017 (UTC)[reply]
Not a search engine guy, hence no opinion. Matěj Suchánek (talk) 19:33, 21 December 2017 (UTC)[reply]
I am neutral on this. I do not see it to be very useful for WDQS and other search purposes, for the same reason Lucas already stated above - the ID is Q42 or tt1954347 or whatever it is, and from whichever source you'd be getting it, better chances are you will be getting the whole ID, not just the numeric part of it. Also, since WDQS stores the whole string, that would be the efficient way to search it, and search by substring would be much less efficient. Same would (and I think, should) happen with any other index - it would store whatever the Wikidata database stores, not split in parts.
That said, if the common usage for stating those IDs is "FMA50801" - say, in the literature or other databases - while brain (Q1073) has just "50801" as Foundational Model of Anatomy ID (P1402), then it may be indeed useful to record that common prefix, and I see no issue with it. I also agree with Lucas that we should follow the nomenclature of the issuing authority and common usage - so if proper FMA ID is "FMA50801", that's what we should have been using in the first place. That of course should be decided by the domain experts. Smalyshev (WMF) (talk) 22:40, 21 December 2017 (UTC)[reply]
Regardles of whether the literature or other databases refer to the entity as FMA50801 or FMA:50801, using that as the value for the property would break the formatter ID. From Googling it seems refering to it from external databases is most commonly done with FMA:50801. Given that `:` is already syntax to search specifically the Q and P namespaces it would be consistent to expand it to other namespaces like the namespaces implied by external-id properties as well. ChristianKl01:39, 25 December 2017 (UTC)[reply]
  •  Oppose I've struck out my supporting comment above - I think a lot more research is needed on how to do this properly, with some real examples of how it would help. I don't think this is at all ready at this point to have a property. What would the property be for an identifier not listed on http://www.ontobee.org/? What would it be for an identifier that's listed there but where we've included some part of the ID in the wikidata entry for the value. Or other variants along those lines? We just don't know enough about this now to see if it would be helpful at all. ArthurPSmith (talk) 19:38, 22 December 2017 (UTC)[reply]
  • I have found myself in situations where I wanted to search for items that have particular external ID in the past. That happened with GND, VIAF and also FMA.
I don't think it's something particular to Ontobee to focus on those letter as being a prefix. The same letter are listed for the same databases on http://ifomis.uni-saarland.de/bfo/users . ChristianKl01:39, 25 December 2017 (UTC)[reply]
@ChristianKl: You were hoping for this to work like the P: search does now in the wikidata search tools? I think we need some indication from Smalyshev that they could do that. Meanwhile, are you aware of the Wikidata Resolver? I think that's from Magnus - if you're thinking we could enhance that instead (so "VIAF:xxx" isn't one of the few supported prefixes) we should probably consult him. In either case, this would end up being a wikidata-internal prefix for most properties (not listed on the sites you mention) but I suppose if those are the actual use cases that could work. ArthurPSmith (talk) 14:47, 26 December 2017 (UTC)[reply]
Magnus resolver could also benefit from the data being available. Currently, the VIAF etc have to be hardcoded in the tool. I think we shouldn't just give all external-ID properties this property but only those were there's something like an official prefix. Most databases that are intended to be controlled vocabulary have that but when we just use random numbers a website URL it's a different matter.
In addition to the data being used automatically I think it's also data that's generally valuable to store. ChristianKl17:27, 26 December 2017 (UTC)[reply]
 Oppose "commonly used" depends on context. If you want the same prefixes as listed at Ontobee, I'd use this approach: https://www.wikidata.org/w/index.php?title=Q7876491&type=revision&diff=616633535&oldid=613107280 with catalog code (P528). -- JakobVoss (talk) 20:53, 4 January 2018 (UTC)[reply]
prefix.cc
@JakobVoss: I like that use of catalog code (P528), I think that is a good way to document these abbreviations without assuming they are a standard. ArthurPSmith (talk) 16:52, 5 January 2018 (UTC)[reply]

How does this relate to the prefixes listed at http://prefix.cc? I'd expect a prefix to be an abbreviation to the URI namespace given with formatter URI for RDF resource (P1921). For instance VIAF ID (P214) has formatter URI for RDF resource (P1921) "http://viaf.org/viaf/$1" so the namespace is "http://viaf.org/viaf/" which " which has prefix "viaf" as listed at http://prefix.cc/viaf. Please note that prefixes are not unique!-- JakobVoss (talk) 20:41, 4 January 2018 (UTC)[reply]

  •  Support Ok, back to supporting this. See also Compact URI Syntax from W3C although that is a general format while this is an approach for standardizing prefixes. @JakobVoss: I think this addresses your concern above by being very specific about the context (i.e. what is supported by identifiers.org and n2t.net - I've updated the description to be explicit about this also). ArthurPSmith (talk) 19:00, 25 January 2018 (UTC)[reply]

I am still not convinced, the current name and description does not document intended usage. If this property is going to be used on properties only, it should better have a name that does not look like it can be used for any kind of ontology. Furthermore it is not clear whether values should be unique and/or repeatable, what are the property constraints?. As far as I understand, this property is a counterpart of formatter URL (P1630), isn't it? If a formatter-URL ends with $1, the prefix stands for the formatter URL without $1, so maybe some name like "URI prefix"? More examples might also help. -- JakobVoss (talk) 10:21, 26 January 2018 (UTC)[reply]

@JakobVoss: I tweaked the documentation a bit. I don't think "URI prefix" is appropriate as a name - these are not URI's (even though they correspond to URI's) - the W3C reference I quoted above says "CURIEs and SafeCURIEs map to IRIs, but neither a CURIE nor a Safe_CURIE is an IRI or URI." I feel this property could be applied either to properties (P namespace) or to items (Q namespace) that are specifically about the identifier. We probably want single-value and uniqueness constraints as well as the format constraint based on the allowed values; there isn't a place in the property documentation template for anything except the allowed values constraint. ArthurPSmith (talk) 17:13, 26 January 2018 (UTC)[reply]
If this property is to be applied also with items, then the domain is Wikidata property for authority control (Q18614948) and unique identifier (Q6545185); I changed the proposal and added an example this way. However, I object the single-value and uniqueness constraints because prefixes depend on context. Such constraints can only be established with a given authority who decices on official prefixes. If identifiers.org and n2t.net is going to be the authority, name it "n2t identifier prefix", otherwise prefixes may be defined differently outside of n2t. -- JakobVoss (talk) 19:24, 28 January 2018 (UTC)[reply]
@JakobVoss: That was my point, that we target this property for those specific authorities. Since isni is not on their list I replaced your example with another that is from the list (also which we don't have a property for yet). The phrasing n2t.net uses to describe the prefixes is as follows:
These are examples of "compact identifiers", a term that arose from a cooperative agreement between the Identifiers.org resolver and N2T.net to serve a common set of over 600 identifier schemes (or prefixes).
I don't think "n2t identifier prefix" is the right way to label this, as it's based on cooperation between both sites, and also n2t does a lot of other things besides resolve these prefixed identifiers. "identifiers.org prefix" might be ok as that site is focused on just those id's, but it doesn't reflect that the consensus is broader. Maybe "common identifier prefix"? ArthurPSmith (talk) 19:55, 29 January 2018 (UTC)[reply]
@ArthurPSmith: now I got the point! Could you please create a "subject item" for this property with statements about the registry? I don't think we need the property but if we have it, it should be clear what prefix registry it is about ("identifiers.org prefix" looks good). -- JakobVoss (talk) 21:14, 29 January 2018 (UTC)[reply]
Turns out we already had a subject item (but it was under the old label of MIRIAM Registry). I've updated the label and documentation here, thanks for the suggestions! ArthurPSmith (talk) 21:47, 29 January 2018 (UTC)[reply]