User:Strobilomyces/IFWDFAL general

From Wikidata
Jump to navigation Jump to search

Index Fungorum Wikidata Fungus Author Loader (IFWDFAL) - general

On this page I am documenting a Wikidata data loading exercise where from December 2018 to May 2019 I tried to load author citation information of fungi from the Index Fungorum database. Before starting to load I asked on the Taxonomy Project talk page (now archived) whether this would be a useful project, and the feedback I received was positive. My main reason for undertaking the exercise was to understand the practical details of this sort of work, and in particular the use and feasibility of the taxonomy data model in Wikidata. The way that the author information needs to be held, with items for every author, name and basionym, is very complicated. The whole idea behind Wikidata is that the information should be usable by software and therefore the rules must be enforced rigorously. Is it really practical to have such complex rules in a project which anybody can edit?

The author citation string identifies the exact meaning of the scientific name of an organism. For fungi and such groups it is explained on the relevant Wikipedia page and in the International Code of Nomenclature for algae, fungi, and plants. For example, the author citation string for Crepidotus subsphaerosporus (Q10461437) is "(J.E.Lange) Kühner & Romagn. ex Hesler & A.H.Sm."

This task is particularly appropriate in the case of fungi because the nomenclatural information in Index Fungorum is widely accepted. Note that I am not referring to the "taxonomy" information which determines which names are synonyms and which names at each rank should be regarded as the current correct ones; that is much more contentious. The same site also includes a separate set of data called "Species Fungorum" which is built on top of Index Fungorum and which tries to give a guide to synonymy and current names. The Species Fungorum part is not used at all in the determination of the author citation string.

I developed Python software to read relevant data through the Index Fungorum API, check the content (taking into account relevant items in Wikidata), and generate QuickStatements commands to execute the necessary changes, using the V1 version of that tool. The commands are then copied and pasted to QuickStatements and executed in the foreground. So the process runs under human supervision and does not constitute a "bot". If fungus authors are missing in Wikidata but present in the International Plant Names Index (IPNI), items for them are automatically generated, also through QuickStatements. The software performs extensive checking and where there are errors or warnings it is possible to intervene "manually" in various ways.

The software has to assume some WD data structure specifying the items, properties and qualifiers to be used; this is based on the Taxonomy Project tutorial. Often one actual species has numerous names (old or new) and a WD item is needed for each fungus name, whether the name is in current use, whether it is a recent synonym which someone has seen fit to include, or whether it is a basionym (corresponding to the part of the author citation in parentheses if any). Also all of the authors need WD items. The software makes the following data model assumptions, which seemed to be true while the exercise was in course.

  1. Each fungus name must have the instance of (P31) property with a value of taxon (Q16521).
  2. The fungus name item must have property taxon name (P225) with value equal to the taxon name.
  3. The direct author information (if present) will be defined through qualifiers taxon author (P405) and ex taxon author (P697); also year of publication of scientific name for taxon (P574) will be included.
  4. Each name item will be linked to its parent through parent taxon (P171), so each name item must belong to the taxonomic tree. If a new item is created, there must be a link to its parent in the taxonomic ranking, and if the parent is not already in WD it will need to be created, and so on.
  5. The basionym if applicable will be linked through property basionym (P566); also there should be a reverse link from the basionym item through subject has role (P2868) set to {{Q|810198} with qualifier of (P642).

A comment to the tutorial shows my proposal for how to regenerate the author citation string from the WD data structure in the tutorial.

This data structure implies the creation of many WD items for obsolete names, which is unfortunate since there is no clear way to distinguish them from current names. Still, in the current situation (with the current agreed data structure) if the citation strings are to be added I think there is no choice but to create those items following the above rules.

The diagram shows the main elements of the IFWDFAL procedure to add author citation information to WD. The main species names (read from the input file) must already have items in WD.

Here is a list of the types of information that were added in the IFWDFAL exercise.

  1. If an author (identified through the IPNI abbreviations and through botanist author abbreviation (P428) in WD) is not present in WD, an attempt is made to add the author item automatically.
  2. In the case of a basionym, if the basionym item (identified through taxon name (P225)) does not already exist in WD, it will be created.
  3. In the case of a replaced synonym (see replaced synonym (for nom. nov.) (P694), if the replaced synonym item (identified through taxon name (P225)) does not already exist in WD, it will be created. I decided to include replaced synonyms because (although they are not needed for the author citations) in the data model they are similar to basionyms and so I thought they would be easy. But now I think that that was a mistake, since they cause a lot of problems and need much manual intervention.
  4. If parent of a new basionym item or parent item (identified through taxon name (P225)) does not already exist in WD, it will be created, at least as a skeleton. This is a recursive process and it may require manual intervention to provide some of the information. A new parent may itself have a basionym, which will be created accordingly.
  5. The direct author and publication date information (the taxon author (P405), ex taxon author (P697) & year of publication of scientific name for taxon (P574) qualifiers) will be included in new taxon items, and in the case of main species items, existing basionym items, and existing replaced synonym items it will be checked and added if necessary. A warning will be given if there is a discrepancy with existing WD data. In each case a reference to Index Fungorum will be added if one does not already exist.
  6. In the case of a basionym, the links (basionym (P566) on the main item & subject has role (P2868)/of (P642) on the basionym item) will be checked if existing and added if appropriate. A reference to Index Fungorum will be added on the P566 claim but not on the P2686 one due to a Wikidata/QuickStatements problem. One item can be the basionym of several other name items, and because of the way QS merges the claims in this case, the correspondence between the references and the "basionym of" values is not preserved. I pursued this on Magnus Manske's talk page (see https://www.wikidata.org/wiki/Topic:Utkuvwspdciq725k), and a solution was proposed, but I think it is too complicated to be worth while. The case of a replaced synonym is handled in a similar way to that of a basionym.
  7. When new items are created, the English label is set to the taxon name and the English description is set to a relevant string like "synonym of fungus (basionym)". The latter is the only thing which distinguishes a basionym from a current name. In the case of a new basionym item, the alias is set to the "basionym of" name. For new items only the following properties are added: instance of (P31) (= taxon (Q16521)), taxon rank (P105), parent taxon (P171), Index Fungorum ID (P1391), MycoBank taxon name ID (P962).
  8. On 2019-02-19, editors approved property taxon author citation (P6507), which allows the author citation information to be stored simply as a string. The Index Fungorum Wikidata Fungus Author Loader adds this property for the main item, basionym and replaced synonym items, and newly created taxon items.

The loading process is described in more detail here.

I kept a LibreOffice Calc file which records the status of the species which were processed. First I loaded the citation information for the Marasmiaceae and then I started to load it for all the Agaricales in alphabetical order of genus. By 6th May 2019 I had loaded the citation information for 4822 fungus items.

Then user:Brya deliberately deleted the taxon name (P225) claim for Agaricus crociphyllus Cooke & Massee ({{Q|63459690), which I had created. My author citation system assumes that the P225 claim will be present for all taxon names and there is no point continuing with this exercise if these claims may be deleted. So I have stopped loading the author citation data and will not continue until this issue is resolved.

Brya and I discussed the problem on the taxonomy project talk page. Brya wants the taxon name (P225) property not to be used for incorrect taxon names. But this property is really important for any automatic processing of the items (the property instance of (P31) could better be used to indicate that names are not "real"). It is the taxon name (P225) property which identifies what the item is. It is just as important to set it correctly for obsolete or illegitimate names as for current ones; the item is of no use without it, since then it would not be possible systematically to identify what name it refers to. It is not satisfactory to rely on the language label to know the name - that would be contrary to the WD philosophy and would require many new rules to be agreed. It would be necessary to decide which language label(s) would contain the name. Sometimes the language label has to have extra text (the same name may have different meanings) and a specification of how to handle all possible cases would be needed. Neither Brya nor anyone else has proposed a detailed alternative method which could be used - and detailed rigorous rules are absolutely essential. Brya's changes not only break the IFWDFAL system, but would also cause similar problems for any other software using the data.

My main conclusion from this exercise is that the current method of holding the author citation information is too complicated for a project like Wikidata which does not have a central authority to decide the data structure. The objective of Wikidata is to hold knowledge in a form which can be processed automatically, implying that rigorous rules are needed. At present no-one specifies the rules clearly and authoritatively and if there are various editors with their own ideas and assumptions, they are likely to conflict.

The method of property taxon author citation (P6507) (which simply stores this information in a string) is workable in practice and so it is actually a preferable approach.