Wikidata:Property proposal/broader concept/archive

From Wikidata
Jump to navigation Jump to search
Archived discussion of proposal in its original form (ie when it was being proposed as a top-line property, rather than a qualifier

Discussion

  • Proposed. Jheald (talk) 14:22, 2 February 2018 (UTC)[reply]
  •  Support I thought this had come up before as a proposal, but I couldn't find a link. I've previously been wary of such properties as I think we cover it reasonably well with P279, P31, P361, etc. but based on your argument that this is a way to better represent a statement in a reliable source that isn't more specific on the type of relation, I think there is a real use for this. ArthurPSmith (talk) 16:35, 2 February 2018 (UTC)[reply]
  •  Oppose I'm not seeing an example where the existing properties do not suffice. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:35, 3 February 2018 (UTC)[reply]
    • @Pigsonthewing: I'm sorry Andy, but how would you do this given existing properties? Jheald (talk) 01:16, 4 February 2018 (UTC)[reply]
      • folk tale (Q1221280) - "traditional story that is passed down orally" - is clearly a subclass of oral literature (Q986539). Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 18:40, 4 February 2018 (UTC)[reply]
        • @Pigsonthewing: But, as noted above, not every "broader term" relationship will be translated into subclass of (P279). Some may be P31, some P1269, some P361, some may not have any direct link at all. Now suppose somebody wants all the Q-numbers for all the entries in a particular part of the thesaurus. With this property, it's easy, one has the information about the thesaurus structure in WDQS, one can extract it in a couple of lines of SPARQL. Without it that's not possible.
        Equally, with this property one can indicate whether a relationship beween two properties is supported by the thesaurus. Without it, how would one do that. Would one reference the P279 to the thesaurus? That's not quite honest, because that isn't the relation the thesaurus necessarily indicates. But otherwise how would one indicate the link? And what happens if somebody inserts a new class in between them? Then how does one indicate there was a relationship in the thesaurus.
        In practical tems, this property makes upload, verification, and re-verification much easier. Practically it gives one a way to record the fact that the thesaurus relates the two items as soon as they have been identified to Q-numbers; that information is then easy then and at any later time to verify for existence and completeness; then, now the thesaurus relation is stored in WDQS, it is then easy to query for which of these thesaurus relations correspond to which kind of Wikipedia relation or none at all. It makes it possible to break the upload/matching into different stages. First, identify item A. Then, at some later time identify item B, and note that the thesaurus relates the two. Then, as part of again a later phase, perhaps have Oppose a Listeria query identifying the simplest n-step relationship between the two using the standard Wikidata relationships (or whether none exists at all), allowing one to consider whether that seems appropriate, or whether one should create some new more direct statements. This is made hugely simpler if the information is in WDQS and accessible.
        As I wrote above, this is not intended to be a substitute for P279 etc, it is intended to be additional to them: a 'both/and' relationship, because it performs a different task. P279 etc show how we model the relationship; this use of this property is to record how it exists and is represented by others. Jheald (talk) 19:44, 4 February 2018 (UTC)[reply]
  •  Support. @Jheald: is right that "broader" maps to various WD props. The newest ISO standard has 3 "broader" sub-props (generic, partitive, instantial) that are used in the Getty vocabs, and this paper analyzes which of them can be composed: On the composition of ISO 25964 hierarchical relations (BTG, BTP, BTI). Alexiev, V.; Lindenthal, J.; and Isaac, A. International Journal on Digital Libraries, 1-10. August 2015. DOI 10.1007/s00799-015-0162-2. @Peter F. Patel-Schneider: is right that skos:broader is "squishy" (left this way on purpose in the SKOS spec), but is wrong in saying "let's just get better sources": LOTS of KOS are published as SKOS, and there's no viable replacement for these sources (I think thesauri specialists like eg @Jneubert: will emphatically agree here). skos:broader can at least be used to check and build up the established WD hierarchies. --Vladimir Alexiev (talk) 15:18, 21 February 2018 (UTC)[reply]
  • I  Support to have basic Skos classes like this in Wikidata. ChristianKl10:48, 5 February 2018 (UTC)[reply]
  • Maybe some more voices here would be helpful? ArthurPSmith (talk) 16:45, 9 February 2018 (UTC) WikiProject Ontology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.[reply]
  • tend to  Strong oppose as I think Wikidata should adopt a realistic point of view. In that view, Wikidata is not about terms, Wikidata is not a thesaurus, it’s a description of the real world. If we do that, it will be unclear if we are talking of Wikidata items themselves as « terms », or if Wikidata items represent external concepts that represents the world themselves. On the other hand, SKOS specifically talks about « https://www.w3.org/2009/08/skos-reference/skos.html concepts» that can be broader that each other … The difference is explained in skos description. Thesauri belongs to stuctured wiktionary. To take their example about love, I think Wikidata should be about love itself, and not about the concept of love as describe in a thesauri. author  TomT0m / talk page 17:22, 9 February 2018 (UTC)[reply]
    • @TomT0m: I've tried to mitigate that in two ways. First, by deliberately naming the property "broader concept", rather than "broader term" per SKOS -- because, indeed, Wikidata items are about concepts, not terms. Secondly, by emphasising above that this property is not intended to be a substitute for our established properties like P279, P31, P1269, P361 (and more) that do try to capture facts about the way the world is. The present property is indeed a different kind of thing. But there is a value in being able to benchmark ourselves against projects which are not so ontologically ambitious. If they say there is some kind of broader/narrower relationship between two items, that in itself is worth recording, it is something that is easy to import, and it is something that we can then query to see whether we do have facts in place via our factual properties to account for this stated relationship. As the piece you cited says, they have their own value as a informal, lightweight, convenient structures that can be used as-is, while still pursuing the 'intellectually demanding and time consuming, and therefore costly work' of building a representation of how things actually are. I think the piece highlights quite well how the broader/narrower property is of a different kind to our existing properties, and trying to do a different thing. We can (and in my view should) have both. Jheald (talk) 18:38, 9 February 2018 (UTC)[reply]
    A further point, is that if we are going to record that our item X corresponds to a term A in a thesaurus (as our external IDs do), that the thesaurus considers equivalent to a term B, then it is valuable to record the context of that equivalence in the thesaurus -- ie how that A=B term appears in its hierarchy; as a warning, that in a richer hierarchy one might not consider A=B, and might differentiate between them. Jheald (talk) 18:45, 9 February 2018 (UTC)[reply]
    How about that : we have a « term » entity in structured wiktionary. We have a property « external thesaury equivalent term » which points to the thesaury on one of our term entity, and on the other we have a « meaning » property which points to the wikidata item. Terms entities are then clearly identified to terms and not mixed ambiguously with their meanings. We use skos properties on term entities only author  TomT0m / talk page 19:15, 9 February 2018 (UTC)[reply]
    @TomT0m: No. Because people who use thesauruses want to be able to resolve those terms to Wikidata items, to Wikipedia articles, to Wikidata's ontology and onwards to other resources. Because it's of value for them to be able to extract all the Wikidata items that correspond to whole parts of their thesaurus. Because it's of value to us, to help identify (i) concepts and (ii) relationships between items we may not be representing. And, fundamentally, because these thesauruses are not about words. That is not what they are used for. They are about representations of concepts, albeit in quite a weak ontology of relationships. They do not align to Wiktionary, but Wikidata. Jheald (talk) 19:37, 9 February 2018 (UTC)[reply]
    I don’t see how all this won’t be possible with my model. The lexical entities will probably be available through the query service, and the link term <-> item is maintained. author  TomT0m / talk page 20:06, 9 February 2018 (UTC)[reply]
  •  Oppose This appears to be some sort of squishy relationship, either subclass, or instance, or maybe something else. The only reason to include it is to copy over a squishy relationship from a sloppy external source. But instead of carrying over a squishy relationship why not just source a crisp relationship from a better source? It's not as if Wikidata has to be able to reflect all bad information from elsewhere. Peter F. Patel-Schneider (talk) 18:38, 9 February 2018 (UTC)[reply]
    @Peter F. Patel-Schneider: I fear your comment may be a classic instance of setting up "the best as the enemy of the good" -- with the usual consequence that as a result one doesn't get either. Per the piece TomT0m cited above, a relationship doesn't have to be perfect to be useful. Yes, it's a "squishy" relationship. That's why it's worth creating a separate property for it, so it doesn't infect our existing "crisp" ones. But it's worth recording so that (a) if it's a thesaurus somebody else is using, then they can match straight to it; (b) it can be a useful part of the workflow in matching our items to external ones. I have real work I think this property could be very useful in helping. At a time when it's apparently no problem to magic another 10 million new items into being for scientific papers, I would be really pissed off if that workflow was blocked just because somebody felt these well-documented external statements were too "squishy" to be worth recording. If you don't have a use for them yourself, then fine. But why go further than that and prevent people who do have a use for them? Jheald (talk) 19:08, 9 February 2018 (UTC)[reply]
    @Jheald: I view this as more of "the bad is the enemy of the good". Mass-importing bad information not only ends up with stuff that is useless (or nearly useless) for Wikidata, but can interfere with getting useful information into Wikidata. AFter all, why did Wikidata go to the effort of having a reasonable set of ontological primitives? (And I'm certainly not arguing that Wikidata is in the best-of-class category, just that it should not slip further down, into just being a random set of unfathomable links.) Peter F. Patel-Schneider (talk) 19:56, 9 February 2018 (UTC)[reply]
    @Peter F. Patel-Schneider: Part of Wikidata's mission is to represent multiple versions of "truth", not just our own. Being able to faithfully represent external thesauruses, including their ontology even if weaker than ours, seems a valuable part of that. We represent facts like dates of birth, death, etc, even if they are not our preferred values, if they in use in the wild on reputable sites. Writing off heavily used structures as 'bad', just because they may not be as complete as ours, seems to me to be throwing the baby out with the bathwater. I don't see why you think adding this property would be so tainted. No, it would not express what our existing properties can express; but it would help us to document what else is out there; assist people in matching their data and their structures and using our data; and assist us in identifying gaps and unmatched relationships. As I keep stressing, it is not intended to be a substitute for our own "reasonable set of ontological primitives", it is additional, specifically for the purpose of documenting relationships as presented by others. Why do you find that so objectionable that you want to forbid it? Jheald (talk) 21:01, 9 February 2018 (UTC)[reply]
@Jheald: Yes, Wikidata has to be able to represent claims from external sources. (In fact, I am of the opinion that more of the information in Wikidata should be sourced.) However, the need to represent claims from external sources means that Wikidata has to be careful in how it does its representation. A relationship like broader-than is very quickly going to be overused and thus meaningless as it is used for different squishy-more-general-than relationships from different sources. So even in the best case Wikidata would end up with appliance broader than kitchen stove, kitchen broader than kitchen stove, cooking broader than kitchen stove, heating broader than kitchen stove, kitchen stove broader than electric stove, kitchen stove broader than burner, and kitchen stove broader than oven, which ends up as just a mess. But maybe the intent of the proposal is not to have a single broader-than relationship but instead to have multiple such relationships, i.e., to determine the meaning of a broader-than claim one has to use the meaning of the relation in the source of the claim. This makes more sense but I strongly oppose it as undermining the basis of Wikidata. So having a single squishy broader-than in Wikidata ends up polluting Wikidata and having a relationship in Wikidata that can only be interpreted by looking at its meaning in the source ends up undermining the whole premise of Wikidata. Peter F. Patel-Schneider (talk) 01:06, 10 February 2018 (UTC)[reply]
@Peter F. Patel-Schneider: So on kitchen stove (Q182995), somewhere down near the bottom of the item, we would end up with four values for "broader concept", each referenced to the different sources that gave it. (Incidentally, also making it possible for different thesauruses to map how closely or differently other thesauruses see terms to them, in the same way that our multiple external ids have turned into a Rosetta Stone for mapping different identifiers to each other). I don't see why you see a problem in this, or why you think it would be "undermining the whole premise of Wikidata". On the contrary, having information recording the item's meaning (or more specifically place-in-hierarchy) in the source can actually be quite a useful check -- for example, identifying ethnic terms in the Art & Architecture Thesaurus (Q611299) that have been matched to an ethnic group, when checking the hierarchy reveals that the thesaurus term is actually an identifier for the material culture of that group; and so has been mis-matched here (unfortunate, but there's more of it about than one would like). So I don't see including some information about the source context as undermining Wikidata at all -- on the contrary, it helps us check and validate the matches we've made in assignments of properties like Art & Architecture Thesaurus ID (P1014). Jheald (talk) 22:05, 10 February 2018 (UTC)[reply]
@Jheald: See my response below.
  •  Comment I'm ambivalent. I've done a lot of work with the Europeana Fashion Thesaurus which uses <broader concept>, and in almost all cases I have used it to mean <subclass of> ("sailor collar" <subclass> of "collar"). In a few places, I have used <facet of>. But in some cases, their "broader concept" makes no sense in Wikidata's wider context (EFV "showroom" <broader concept> "fashion event"). I don't see that adding a property <broader concept> to "showroom" and referencing EFV would add anything meaningful to Wikidata. [On a related topic, I am interested in a concept like <shares features with>, but that would be another discussion.] - PKM (talk) 19:44, 9 February 2018 (UTC)[reply]
    @PKM: Yet, if somebody wants to extract all items in the EFT tree, or a part of it, or statements referencing items in a part of it, they cannot do that in a query unless we have the connections that define that tree in the system.
I'd also comment that sometimes when <broader concept> doesn't correspond to subclass of (P279), that can still be indicating a meaningful connection that we *should* make sure we have some representation -- for example, in the UK Parliament thesaurus, I'm fairly sure that Department for Environment, Food and Rural Affairs (Q3044721) has <broader concept> fishery (Q180538). That's a pair we too would want to connect, but perhaps through field of work (P101) or perhaps interested in (P2650). (Aside -- perhaps not an obvious property to include, if one was trying to reconstruct a query for things relevant to fishing according to the thesaurus, if one didn't have <broader concept>). On the other hand, in Wikidata, is the proper target fishery (Q180538), fishing industry (Q635139), or fishing (Q14373) ? Not obvious, and getting it right may, as TomT0m's source put it, be "demanding and time consuming". But at least with <broader concept> one can flag that some sort of relationship exists, which we ought in some way to be representing with our more concrete ontological properties. Jheald (talk) 20:17, 9 February 2018 (UTC)[reply]
But why should Wikidata even be interested in extracting all the items in the EFT tree and their EFT tree relationships? That doesn't seem to be even close to meeting the Wikidata notability requirement. Peter F. Patel-Schneider (talk) 21:19, 9 February 2018 (UTC)[reply]
@Peter F. Patel-Schneider: But it's not just the EFT tree is it? There are at least two big trees from the Library of Congress that I want to work on, the AAT and various cultural trees including from the British Museum that User:Vladimir Alexiev and others have been working on, the thesauruses used by various agencies in the UK for describing historical sites and artefacts, the UK Parliament in-house team who are very interesting in aligning their thesaurus with Wikidata, and more projects beyond that. There are some serious concept gaps in Wikidata, particularly higher up in the tree where Wikipedia articles are often dabs. (For example, I found a few days ago we don't even have an item for the general concept of "identification"). We don't even know how well we can match the concept tree of sister projects like Commons. These are worth being able to compare ourselves with, and trying to benchmark ourselves against. Jheald (talk) 21:36, 9 February 2018 (UTC)[reply]
@Jheald: Sure, there are problems with the Wikidata representation. The solution is to fix these problems, not make them worse by adding squishy relationships. Yes, this is more work than just pulling in whatever is elsewhere, but the work is needed to support the goals of Wikidata. Peter F. Patel-Schneider (talk) 01:10, 10 February 2018 (UTC)[reply]
@Peter F. Patel-Schneider: So you think this property would make Wikidata's representation worse? Why? How? It would be a completely separate property, it shouldn't impact on how we represent things using subclass of (P279) / instance of (P31) / etc. at all. But the one thing it might do, by encouraging us to systematically mine the "broader term" links would be to help identify items we currently don't have -- at all. Jheald (talk) 22:10, 10 February 2018 (UTC)[reply]
@Jheald: What is going to happen if a "broader than" property is added to Wikidata? The property starts out with a vague and squishy meaning and so it is an easy target for mapping, both from sources that have a vague property with a name like "broader than" and from sources that have properties that are crisp but for which mapping to the other Wikidata ontological properties requires some thought or effort. (I think that will even happen for claims from sources where subclass of (P279) or instance of (P31) should be used.) This will on occasion push the boundaries of the Wikidata property, for example, to add thematic broadness, which only increases its ease of use. So Wikidata ends up with this heavily used ontological property that has a vague and expanding meaning. But there was a heavy price to pay for this "ease" - more specific ontological relationships have been replaced by this vague relationship, reducing the ability to make useful crisp inferences from combinatations of ontological claims in Wikidata. (Of course, if every use of the "broader than" property was accompanied by a crisp ontological relationship then there is no problem here, but then why not just use the crisp property?) The alternative to this creeping vagueness would be to say that the meaning of a "broader than" claim in Wikidata is the meaning of the claim in its source material. This at least makes it possible to make inferences from multiple Wikidata "broader than" claims that are no vaguer than the inferences that can be made in the source materials. However then Wikidata doesn't provide its own meaning at all for "broader than", undermining that I think is one of the central underpinnings of Wikidata. Peter F. Patel-Schneider (talk) 22:02, 12 February 2018 (UTC)[reply]
@Peter F. Patel-Schneider: No. more specific ontological relationships have been replaced by this vague relationship. I don't believe this. As I have been at pains to point out, this property is intended to be present in addition to our normal ontological relationships on an item, not instead of them. This is true both at a broad level and at a specific. According to their talk pages, we currently have 39,882,357 uses of instance of (P31), and 1,349,866 uses of subclass of (P279). Compared to that the AAT, which I believe is the largest thesaurus we currently match to, has about 47,000 entries. So the existing properties will continue to be overwhelmingly more common and dominant.
We currently have 15706 matches to the AAT. (See User:Jheald/aat/full for content and current matches, hierarchically). Of those, 2516 currently don't have either a P31, P279 or P1269. (tinyurl.com/y8udwerr). I don't believe there's any foundation for thinking, "broader concept" would make it less likely they would acquire one. I actually think the reverse is more likely to be true: "broader concept" identifies a potential target, and then one can go through and ask whether any of the above properties is suitable.
if every use of the "broader than" property was accompanied by a crisp ontological relationship... then why not just use the crisp property? For the reasons already spelt out several times. It is useful to have this property recording the source relationship in addition to statements using the crisp properties. Why? Because this property is referenceable to a source, and verifiable that what is in Wikidata indeed matches what the source says. Because this property is a useful stepping stone in the import process towards getting statements using own primitives (as just discussed). Because this property is a useful check on the identity matching: does the item here in fact correspond to the entry in the external source -- if the ontology here does not match the relations revealed by "broader concept", then there may have been a mismatch (eg 'Alhambra' vs 'Alhambra ware'; 'New Zealand' as a country vs 'New Zealand' as a family of ceramic styles). Because this property helps one extract the entries in a sector of the external hierarchy in a query, something which may not be straightforward with our own properties. Because it may throw up cases where there is a relationship, but one not easily expressed by our existing properties, or only via multiple hops. Etc., etc.
the meaning of a "broader than" claim in Wikidata is the meaning of the claim in its source material. Yes, this is exactly what is proposed. The property is intended to be a record of a relationship in the source material. No more, no less. But that is still something that, for import, for completeness verification, for quality control, for sourcing, for referencing, for comparison, for extraction -- for all of these, I believe is significantly valuable.
then Wikidata doesn't provide its own meaning at all for "broader than", undermining that I think is one of the central underpinnings of Wikidata. What central underpinning? Fundamentally, like all WMF projects, WD exists to work with existing material, collate it, present it, and make it accessible -- not to engage in original research. This makes it sound as if your objection to this property -- the reason why you seek to forbid it, regardless of the uses others may want it for -- is specifically that it is exactly the former, viz. recording and presenting existing material, that it is trying to do. Jheald (talk) 23:01, 12 February 2018 (UTC)[reply]

I've pinged a more few people who appear to be interested in thesauruses, eg editors who have been most active in mix'n'matching the Art & Architecture Thesaurus (Q611299), so that they know this discussion is going on. Jheald (talk) 19:19, 10 February 2018 (UTC) [reply]

frwiki article section about relationship between OWL and RDF[edit]

« Le document SKOS Reference définit la classe skos:Concept comme une classe OWL (skos:Concept rdf:type owl:Class). OWL apparait donc comme le méta-modèle dans lequel sont définies les classes et propriétés du langage SKOS, et une instance de skos:Concept est, au sens de OWL, un « Individual ». C'est une distinction essentielle entre une structure de concepts et une ontologie. La structure est destinée avant tout à faciliter une circulation cohérente dans un domaine et ses dimensions, alors que l'ontologie inventorie les types d'éléments (classes) qui peuvent y être rencontrés en fournissant de surcroît des informations sur les éléments individuels possibles (instances). Aussi, compte tenu de la proximité des moyens mis en œuvre (triplets RDF ; termes identiques ; hiérarchies homologues ; graphes ; etc.) dans les deux cas, il est important de limiter la confusion entre les deux modèles de données, chacun pouvant être légitimement exploité pour ses caractéristiques propres au sein d'une application mixte. »

Translation in English : « The SKOS reference document defines the skos:concept class as OWL class (skos:Concept rdf:type owl:Class). OWL appears as a consequence as the meta-model in which the classes and the properties of the SKOS language are defined, and an instance of skos:Concept is, in OWL sense, an « Individual ». It’s an essential distinction between a concept structure and an ontology. The structure is designed first to ensure a coherent circulation in a domain and its dimension, while the ontology catalogs the element types (classes) that can be encountered […]. As a consequence, and as the techniques used for implementation are so closed (RDF triples, identical terms, hierarchies, graphs …) in both cases, it’s important to limit the confusion between the two data models, each being legitimately exploited for its own qualities in a mixed app. » This shares my concern about maintaining distinct the concepts (through the terms) and Wikidata items. In the OWL implementation of skos, in particalar, no concept is a class, they are all individuals. If we then use skos properties on wikidata classes, we break this conceptual barrier … (EDIT although to be fair, with punning it is not a problem : https://www.w3.org/TR/owl2-new-features/#F12:_Punning )author  TomT0m / talk page 20:39, 9 February 2018 (UTC)[reply]

@TomT0m: If I have understood what you have written correctly, I think you're raising an issue that doesn't exist here (as I think you recognise in the last sentence). Essentially (if I understand you correctly) your point is that skos:broader relates instances; whereas many of the things here that we match to thesauruses are classes.
But this is not a problem for us, because we're quite happy for love (Q316) to be subclass of (P279) emotion (Q9415) as well as to be instance of (P31) concept (Q151885). Okay, the latter isn't particularly useful, so is often left only implicit; and the former is often currently not correctly represented, especially for abstract things (indeed love (Q316) is arguably currently wrong). But the point is we don't require distinct items for the two.
So it doesn't really apply, that skos:broader relates instances whereas we would be applying it to classes, because under the rules that we have chosen classes can be instances as well. (And the rules we have chosen are ultimately what matter, because at the end of the day we are not SKOS or OWL or anything else, we are Wikidata).
Have I understood you correctly? Jheald (talk) 22:07, 9 February 2018 (UTC)[reply]
I think you did. Although the fact that we are Wikidata make us of course make our own choice and adds some flexibility, that could make things utterly confusing. I tried to read stuffs about thesauri while documenting for this discussion and their definition varies greatly depending on the document you read. see, for example, http://www.dictionary.com/browse/thesaurus . Although a constant in that definition is that they seem to be depicted as a list of terms, usually, to which there is associated definition (which depicts a concept). The main difference is that in some definition thesauri are mostly term centric, One problem with that approach is that terms are tight to a language, while our items are not. Our items are not about a term. Another definition is given by https://en.wikipedia.org/wiki/Thesaurus_(information_retrieval) which is more « controlled vocabulary » and meaning centric. However, it’s designed to model topics on documents, and some concepts are cumbersome in Wikidata contexts, like a « synonym » : would you create an item for each synonyms ?? would you store synonyms on Wikidata ?? Wikidata is not designed to do this. This is a work for a (structured) dictionary. They are designed to describe how topics of terms relate to each over, not to describe real world objects … There is already a great confusion in Wikidata on items describing terms or items describing external object we use some terms to name, and I fear that using skos would add confusion instead of clarifying, sorry. A related notion is « https://en.wikipedia.org/wiki/Controlled_vocabulary ». I think to view Wikidata as a thesaurus would be to see Qids as terms in a controlled vocabulary. The whole point is that Wikidata do not talk about its items, nor it is using them to index documents. If you view each item as a dataset, a kind of document, then you could index them with a controlled vocabulary. author  TomT0m / talk page 10:44, 10 February 2018 (UTC)[reply]
@TomT0m: So to avoid any unclarity, the term "thesaurus" on this page is definitely being used in the sense of thesaurus (Q17152639) = en:Thesaurus_(information_retrieval), "a form of controlled vocabulary with inner structure" as that item's @en description puts it. As you say, meaning-centric. Not a book of words.
How does one record synonyms? Currently we do that with the aliases for items. We don't record every synonym -- we aren't a thesaurus in the sense of a word book; but we do record some, which help with search, and also help clarify the extent of the intended scope of the item. Thesauruses in the information retrieval sense do likewise. They don't record every synonym, but they generally do give some, again to help with search, and also help clarify the extent of the intended scope of the item. Their scope may not be quite the same as ours, so the property Wikidata:Property proposal/alternate names is proposed, that could be used as a qualifier for properties like Art & Architecture Thesaurus ID (P1014) -- or equally on properties like Union List of Artist Names ID (P245). It also helps act as confirmatory sourcing for the aliases we do use.
As for "controlled vocabulary", we actually do use Wikidata exactly as a controlled vocabulary, in that so many of properties are item-valued, rather than free text. That is effectively what a controlled vocabulary is -- a defined set of possible values for statements. As for the use you put a controlled vocabulary to, again like Wikidata, that just depends on what statements you want to be able to make. These ones are used for describing heritage sites: [1]. Wikidata can be used to index documents, using properties like main subject (P921) or depicts (P180) or shown with features (P1354). But the main point about a controlled vocabulary is that it controls and standardises the values that statements can have. Making out that that is fundamentally different to Wikidata, and not part of what can be done with Wikidata, is just building a wall that doesn't really exist. Jheald (talk) 22:41, 10 February 2018 (UTC)[reply]
@Jheald: at that point I think we both made our cases, so I’ll try to put the light on another perspective You did not (really) comment on the structured wiktionary case. How do you picture it in the whole equation ? Do you think it’s suited for index thesauri and if not, why ? author  TomT0m / talk page 10:29, 11 February 2018 (UTC)[reply]
@TomT0m: I haven't been following WD:wiktionary very closely, but as I understand it, the primary kind of item there will be the lexeme -- as an example the page considers a lexeme for the German word Leiter. It seems that Wiktionary will be structured around language-specific words, not around meanings. So the meaning "leader" and the meaning "electrical conductor" will both be contained as notes in the lexeme Leiter, while on the other hand the English word leader will be quite different lexeme.
So on the one hand Wikidata is primarily about meanings, but may contain some aliases; whereas Wiktionary is primarily about words, but may include some information about 'senses' of the word, to distinguish meanings. However, these senses would appear to be very much secondary, and may well not link together, or transcend different lexemes. It's not an attempt (I think) to build a hierarchical structure of things or an ontology -- that's Wikidata's job; Wiktionary will be doing something else. Therefore, to me, the controlled vocabularies seem much closer to Wikidata in spirit, than to Wiktionary. Jheald (talk) 15:06, 11 February 2018 (UTC)[reply]


 Comment: @Jheald: has brought up an import subject that I think could be one of fundamental functions of Wikidata in the future. Finding a way to record different models of relationship between the same concepts in different external databases is extremely important.

I have no idea what the best way of doing this is but I have some thoughts on where to start researching.

Are there any other databases that record this kind of variation in relationship models that we could explore? Hopefully some very clever people have thought about this before and come up with a solution we can copy :) My best guess where to find these would be in biology, linguistics and library classification systems (@Astinson (WMF): as a start).

If this doesn't exist perhaps another way to explore this subject is by looking at some external databases that hold similar data but understand the relationships between the concepts differently. To start from concrete examples and work our way backwards to see what kinds of variations appear 'in the wild'. Does anyone have any suggestions for this?

Perhaps we can start a discussion on the project chat as a start?

Thanks

--John Cummings (talk) 00:26, 11 February 2018 (UTC)[reply]

I tend to agree. Several points:
@TomT0m: I may need some time to get back to you on that in detail. But a couple of preliminary observations
(i) I'm proposing broader concept now because I can see an immediate use for it. I think it would help with importing from sources with hierarchical structures; would help identify matches to those sources which were not correct; help identify relationships we may be missing; and help by giving a hierarchical statement (even if weaker than our regular ones) that was referenced and could be verified.
(ii) I don't see the proposal as necessarily a 1:1 match for skos:broader. In particular, I think the property could be used for a wider range of sources, not limited just to those in linked data form. For example, I think it could be usefully used in connection with a broadly hierarchical structure like the en:Library of Congress Classification, even though the LCC does not attempt to be a strict ontological hierarchy.
(iii) As a consequence of (ii) I would not see this property as making an ontological assertion -- we have our own mainline properties for that. Rather, I see it principally as a record of how a topic is arranged in an external source. That record may often be interesting to compare with our own ontological structure; but the degree to which the external hierarchy is ontological will vary from source to source, and also often within external sources, from part to part. One should therefore be very cautious before using this property as a basis for ontological reasoning -- it is not what it is being proposed for. Sanity checking, yes (with an expectation of exceptions); reliable reasoning, no.
(iv) I don't think splitting concepts and meanings would be helpful. Better for people to be aware that Wikidata items can exhibit a degree of overloading -- just as already a class item can be subclass of (P279) some class, but also instance of (P31) some metaclass. I think given the noisy-ness and roughness of where we start from, that degree of informality probably fits us quite well. Jheald (talk) 17:29, 13 February 2018 (UTC)[reply]
@TomT0m: On the subect of other skos relations, since yesterday I have been exploring the British Museum linked data system, which contains in-depth linked-data descriptions of about 2 million collections items, using classes and properties from the CIDOC model (overview User:Jheald/cidoc), including about 168,000 entries for "concepts", mostly arranged in about 20 thesauruses that use the skos: predicates. (Overview at User:Jheald/bm). Unfortunately, the site seems to be down today ("502 bad gateway"), but I hope it will be up again soon. User:Jheald/bmt now has a hierarchical listing of the main thesauruses up, similar to User:Jheald/aat and [User:Jheald/lcgft]].
The skos relations that the BM uses are: skos:inScheme, skos:prefLabel, skos:altLabel, skos:broader, skos:related, skos:definition, skos:scopeNote, and skos:example. These would seem to cover a lot of other thesauruses too.
skos:prefLabel we can record with subject named as (P1810); skos:altLabel is what Wikidata:Property proposal/alternate names is proposed to cover. skos:broader is the subject of the present proposal. skos:narrower I suggest we don't need if we are recording skos:broader, but narrower external class (P3950) exists for entries apparently not in Wikidata. skos:related might make sense to record. (I think partially coincident with (P1382) isn't quite right). The free-text fields skos:scopeNote, skos:definition, and skos:example all contain useful information, but it would seem sufficient to be able to read those on the original site.
That just leaves skos:inScheme. To some extend we cover that by giving different schemes different external-id properties; but not always -- for example, British Museum thesaurus ID (P3632) covers several different schemes, as will LoC and MARC vocabularies ID (P4801). Sometimes different schemes can be inferred from the first part of the ID, eg "technique/...", but not always: the BM identifiers for several schemes are of the form "x12345". And besides, having to filter using string-functions is slow and not efficient; also, it's a rather opaque hack, not exactly self documenting. One work-around might be to use object has role (P3831) or part of (P361) to indicate the thesaurus in question.
One other thing that thesauruses sometimes do is indicate a "root term" or "top term" in a scheme. Again, we probably could do that with a qualifier. So those are the main relationships in thesauruses that are recorded using skos. Most of them we can represent, but "broader concept" I do think would be very useful. Jheald (talk) 12:28, 14 February 2018 (UTC)[reply]
On the other point you raised, there is an attraction is separating things from types, eg "a flock of sheep" (an actual flock) from "a flock of sheep" (type of thing). In the CIDOC-CRM set ups, the latter all have rdf:type E55_type (a subclass of concept), whereas the former probably have rdf:type Biological Object. This probably makes sense in a museum, where one has a strong distinction between concrete objects in the collection, and actual individuals who may be related to them, from conceptual types. But it is not I think the direction Wikidata has gone. Jheald (talk) 13:04, 14 February 2018 (UTC)[reply]

 Comment The motivation for the property given here represents a valid use case: Exploiting the well-thought-out hierarchies in existing knowledge organization systems (KOS) makes sense for Wikidata. These hierarchies do not try to achieve ontological pureness, but normally serve the purpose of indexing, classifying or retrieving information in a certain domain and context. They can help complementing Wikidata, as detailed above by Jheald (talkcontribslogs).

So while I absolutely support the purposes of the property, I'm not sure if these hierarchies should be stored as part of Wikidata: There are many KOS which could be used with the property, with broader or quite narrow domains. That could result in a normally sparse patchwork of relations, with lumps on the upper level, which may be contradictionary or even include loops. By the use of qualifiers for provenance, these networks can be separated, however I'm not sure the resulting structures will be helpful to users outside systematic approaches as described above.

Therefore, I'd suggest trying to use Linked Data for building virtual overlays for Wikidata items, organizing them by external hierarchies through SPARQL queries, which link different data sources. (Some time ago, I've implemented that as a prove of concept for descriptors from AGROVOC (Q292649), grouped by the upper level categories of STW Thesaurus for Economics (Q26903352), based on a mapping between the two thesauri, here.)

Federated SPARQL queries on multiple endpoints, as reguired by this approach, are often tricky. I add here two examples which "overlay" Wikidata items with the skos:broader hierarchy defined by STW.

  1. . Query against the Wikidata endpoint This only works when the external endpoint is defined as wd:SPARQL_federation_input.
  2. . Query against the public STW endpoint, reaching out to WDQS (query takes some time).

The overall limitation is that the external data source is, or can be made, available as public SPARQL endpoint. Two notes on this particular example: It ignores for now the mapping relation type qualifier P4390. And, since currently only the geographic part of STW is mapped to Wikidata, the imported hierarchy will not provide much additional information. But I hope it helps demonstrating the idea. Jneubert (talk) 08:22, 18 February 2018 (UTC)[reply]

@Jneubert: Very very valuable. Always interesting to see an example of a federated query, because there are so few around. And good to be reminded that this exists as a different way to do things. The limitation of course is that there needs to be a SPARQL service, it needs to have been approved for WDQS, and it needs to be working. But STW seems impressively quick.
I still think there is a case for a local copy of the information here -- perhaps (I'm now thinking) as qualifiers on the identifier statements, since (strangely?) that may actually be a better structure for path queries. But it's certainly interesting to see this; and it might be useful to consider this as a 'standard query', for documenting federation with external servers. Thank you! Jheald (talk) 20:32, 19 February 2018 (UTC)[reply]

Thesauri specialists (thesaurists?) have long known that the real world is squishy, doesn't always map into the neat subclass/instance categories (BTG, BTI; BTP and its various subdivisions).

@Jheald: wrote "(WD should) work with existing material, collate it, present it, and make it accessible". Hear, hear! WD's ambition to make the world better organized is admirable, but denying datasets (or their features) on the grounds that they are "squishy" is just not a sustainable policy.

For reference, the AAT hierarchy is 12 levels deep, a concept has 2(!) parents on average (but a single Preferred Parent), and is commended as a good thesaural structure. This prop would allow to import the AAT structure, so then WD hierarchies can be compared to it, and the AAT structure can be leveraged to further build up the WD structures. --Vladimir Alexiev (talk) 15:37, 21 February 2018 (UTC)[reply]

@Vladimir Alexiev: For reference: first four levels of AAT enumerated here: User:Jheald/aat. Whole tree at User:Jheald/aat/full :-) Jheald (talk) 15:48, 21 February 2018 (UTC)[reply]
@Jheald: Great stuff! But there's some breakage after "small arms" at the "full page": double bullet, and the rest is in small-size font. Also, I'd change the URLs from ../aat/.. to ../page/aat/.. to lead to the human-readable pages. --Vladimir Alexiev (talk)