Wikidata:Outils/Désambiguïsateur d’auteur

From Wikidata
Jump to navigation Jump to search
This page is a translated version of the page Wikidata:Tools/Author Disambiguator and the translation is 33% complete.
Outdated translations are marked like this.

Le désambiguïsateur d’auteur est un outil pour modifier les éléments des auteurs d’ouvrages présents dans Wikidata. Cet outil a été développé dans le cadre de l’initiative WikiCite, et est partiellement coordonné avec le projet Scholia qui fournit des représentations visuelles de la littérature scientifique à partir des informations trouvées dans Wikidata. En octobre 2020, les statistiques de Scholia montraient que Wikidata contenait des données pour plus de 36 millions d’articles scientifiques, pour lesquels les auteurs étaient représentés par de simple chaînes de caractères (la propriété author name string (P2093)) environ 19 millions de fois. Créer des relations en utilisant des éléments pour les auteurs permet une analyse et un suivi plus riche des relations entre les chercheurs, leur travaux, et les institutions, etc. Le but de cet outil est d’aider aussi efficacement et simplement que possible à convertir ces chaînes de caractères en éléments représentant les auteurs.

Principales fonctionnalités

Trouver et regrouper des ouvrages à partir du nom (approximatif) d’un auteur

Formulaire permettant d’indiquer le nom de l’auteur à désambiguïser

Le champ principal du formulaire est le nom de l’auteur ; cette chaîne de caractères est utilisée à la fois pour trouver des ouvrages de cet auteur, et pour chercher un élément Wikidata qui correspond à cet auteur. Ce nom doit être entré en utilisant l’ordre habituel (prénom nom de famille pour les auteurs occidentaux par exemple). Il est également possible de copier-coller le nom exacte depuis la chaîne de caractère de l’élément de l’ouvrage. Le nom entré est ensuite parsé en composants (séparés par des espaces ou des tirets) qui sont utilisés pour générer les potentielles autres formes de ce nom qui pourraient avoir été utilisées dans des ouvrages. En particulier, les différentes options sélectionnées permettent de déterminer exactement comment le nom doit être utilisé pour la recherche:

  • Correspondance approximative: cette option effectue la forme la plus agressive d'analyse automatique des noms, en recherchant les initiales du prénom et du second prénom, les versions en majuscules des noms, le format « nom de famille, initiale » (Smith J), etc. Dans la plupart des cas, sauf pour les noms de famille très courants, c’est probablement l’option la plus utile si vous essayez de trouver la plus grande sélection possible d’ouvrages à comparer. Notez que cette option ne permettra de trouver que les chaînes de caractères qui correspondent approximativement. Par exemple, la recherche de « Jim Smith » donnera des résultats pour « J Smith » et « Smith J », mais pas pour « Jimmy Smith ».
  • Wikibase search: by default the service only uses exact string matches to the generated variations on the name. With this option the search is also extended to effectively use the Wikidata search box for the name (in particular it will ignore all accents and case variations). The search term is treated as quoted, so "James Baker" will match "Peter James Baker" and "James Baker-Jarvis", but not "James F. Baker" or "James Kenneth Baker".
Example of the "Specify name strings" box with variations on the provided name.
  • Specify name strings: check this and immediately hit the "Look for author" button, and a text box containing the possible name variants appears, looking something like the example to the right. By default this box shows the name variants generated automatically from the supplied name - you may notice in this case there are versions with and without accents, and with initials for middle names or no middle names at all, as well as the full supplied name. The text box then allows you to remove names from the list or to add variants that were not auto-generated. Enter one name string per line. These then allow for more precise specification of the name strings used for searching works and author items. In the example given here, even with fuzzy matching the auto-generated names did not include the common variation "J. Benlloch", so adding that variation was useful.
  • Additional SPARQL filters: this is mostly useful if you are seeing far too many matching works (more than the 500 limit for example!) or if you otherwise want to filter the works you are matching on. The filters will be applied to the associated works, so any property of a work could be used. The example suggestion uses main subject (P921), but you may also be interested in filtering on author name string (P2093) (a co-author name string), author (P50) (a particular identified co-author), published in (P1433), etc.

Filter potential authors as well?: this applies the SPARQL filter to any works the person is an author (P50) for, so only authors with matching works will be listed.

The search for author items also looks at the object named as (P1932) value often used as a qualifier on author (P50) statements, as well as the labels and aliases on the author items themselves. If you are surprised by an author item shown in the resulting list, it may be because of an unexpected (or erroneous) alias or object named as (P1932) value somewhere.

Once works have been found to match the author name string search, a clustering algorithm is used to display them in groups. The groupings are based on several criteria, including the names or identifiers for co-authors, any listed topics, or journal of publication. An alternative clustering algorithm based strictly on the name string format of the given author and the preceding (if any) and succeeding (if any) author names or name strings is also available via a link at the top of the groups. The groups are roughly ordered by size, with the larger groups first, and within groups the works are ordered by (descending) publication date, if any. Works with no publication date found in Wikidata are listed at the end of each group. All works that could not be clustered with any other are placed in a group called "Misc" at the bottom, which is otherwise similarly ordered. The clustering is intended to group works by different authors into different groups, so it should usually be reasonable to select all the works in a given group (except for the "Misc" one) to match to the associated author item.

Start of the "Potential Publications" list, with the first grouping of works.

For each work the title is displayed, linked to the work page within the tool. Then the author list, with already matched author items shown in green (linked to their author page within the tool) and unmatched authors in blue (linked to the associated name search page). The author name that matches the search criteria is shown in black with a checkbox to select if we want that author name string replaced with the selected author item. Other links in the table go either to the associated Wikidata item or to the external website (for DOI or other identifiers). Publications and topics (and for author items, institutions) also link out to the Scholia "missing" page associated with them, which provides a list of associated but still-unmatched author name strings.

If the clustering criteria (co-authors, publications, topics) match one of the author items found, the right-most column of the table shows the matching author (or authors if there are more than one that match), also linked to its author page within this tool.

Matching work for an author, showing the author name string amidst the author list, expected match on the right side.

Note that if there are a large number of authors on a work, the author list is abbreviated to only show the first ten, and then up to five surrounding the matched author name string. If more than one author name string matches, all matching authors will be shown with their associated checkboxes, so the correct one can be selected.

Below the groups of works is the list of potentially matching authors. Only one may be selected, or the "Other Q number for this author" option, where an author not listed may be used. There is also a form for creating a new author item within Wikidata if necessary.

Potential authors listing, with button to start linking process

Clicking the "Link selected works to author" will start a batch process that, for each listed work, replaces the selected author name string with an author item with the same qualifiers and references, and an additional object named as (P1932) qualifier with the original name string value.

Trouver des ouvrages à partir d’un auteur

Champ de formulaire pour l’élément Wikidata d’un auteur

This page (found from the "Authors" link in the top right navigation bar, or via author item links on other pages in the tool) shows all works having a given author (P50) value. Similar to the name search page, an additional SPARQL filter can be used to limit the resulting works list based on topic, publication venue, coauthors, etc. The resulting list of works is again ordered chronologically in reverse by publication date, with the same links shown as works listed in the name search page. If some works have been assigned to the wrong author item they can be moved to the correct one via the form at the bottom of the works list, where the Wikidata ID of the correct author item may be entered.

The "Find duplicates to merge" checkbox searches for works linked to this author that have more than one author name or author name string associated with the same series ordinal (P1545) value - often this is due to duplication, or neglecting to remove the author name string (P2093) value when a author (P50) was added. If the names match (based on similar name parsing criteria as used for main author name matching), then a checkbox is shown next to the work, allowing those values to be merged (i.e. author name string (P2093) and duplicate author (P50)'s removed, qualifiers and references merged, etc.) Cases where the names do not match show a 'mismatch' indicator, which should probably be examined on an individual basis to address the problem.

Afficher et modifier les auteurs d’un ouvrage

Champ de formulaire pour l’élément Wikidata d’un ouvrage

This page is reached via the Works link in the top-right navigation bar, or from a link on one of the other pages. Depending on the checkboxes selected, the page has several different modes for viewing or editing the author list for a work. In all modes the main table shows the authors, listed sequentially based on their series ordinal (P1545) value. Authors with no series ordinal (P1545) are listed at the bottom. As for the name search page, author entries which are just strings (author name string (P2093)) are shown in blue, linked to the associated name search page, and author items (author (P50)) are shown in green, linked to the associated author page in this tool.

In default mode (no checkboxes selected in the top form), the work item page allows removal of un-numbered authors, or merging multiple author/author name string values associated with the same number. If none of these changes are possible, no action button is displayed at the bottom of the page.

In "renumber" mode (check "Renumber authors?") the series ordinal values for any of the author names or items can be modified. This works only up to a maximum of 5000 authors on a given work. Note that in this and in other modes for a work item, when the edit is made it is done in a single edit to the Wikidata item - this reduces the load on associated updates on the query service. Authors with no change in series ordinal value will not be affected by such an edit.

In "match" mode (check "Suggest matches?") a list of potential matching author items is used to try to find items to replace as many as possible of the author name string values remaining. By default this list comes from all items that are coauthors (on other works) of author items already identified on this work. However, other lists of authors may be used for matching by selecting a different choice from the "Author List" drop-down - see the "managing lists of author items" section below. Selecting the 'Use "stated as" names' checkbox uses the full matching algorithm with object named as (P1932) values from other works by that author, making it more likely an author item will match one of the author name strings on the work; however for authors with many works this query will take additional time, so could be avoided if not necessary.

Gérer des listes d’auteurs à utiliser pour les correspondances

This feature is still in development. The page is reached through the "Lists" link in the top-right navigation bar. It allows creation and management of lists of Wikidata author items - a large collaboration, other coauthors, or just a limited topical selection list. The lists can be selected on the work-item page for the purpose of matching authors.

Ordering in these author lists doesn't currently matter; authors are displayed in the order they were added. Authors can be added individually or as all identified authors on a given work or works. Author lists can be compared with one another, and also with the authors on a particular work item, to identify common and differing elements.

Monitoring, stopping, or restarting batches of edits

Edits to work items made with the Author Disambiguator tool are all done in a background batch mode. Each batch consists of one or more edits associated with your activities on a given author or work item. All your batches can be found through the "Batches" link in the menu bar. Batches are listed in reverse chronological order (based on last modified date, not creation date). Each batch is also associated with an "edit group", which can be reviewed with the Edit Groups tool.

For each user (identified through OAuth) only one batch is allowed to run at a time, and within that batch only one edit can be done at a time - that edit is shown as in "Running" state. Other edits that are waiting show as "Ready". A successfully completed edit shows as "Done". If there was any problem completing an edit it will indicate an "Error" state, with an associated message visible on the page for that particular batch. This should be a useful message indicating what the problem was, for example "duplicate ordinal '129'" indicates that two or more distinct author items were matched to the author name at series ordinal 129. If the error message indicates a temporary problem (for example a "failed to save" message from the Wikidata API) then the "Reset errors" link can be used either on the individual batch or batch listing page, and the batch can then be restarted to retry that particular edit. Batches can also be stopped and restarted from the listing page.

Note that there may be times when the Wikidata servers are busy and a particular edit may appear to be in "Running" state for a long time (an hour or more). Check the dispatch lag/maxlag statistics on grafana to verify that this is what is happening. If that doesn't appear to be the problem, try stopping and restarting the batch.

Deleting completed (or erroneous) batches is recommended; this has no effect on the "Edit Groups" functionality or on any of the completed edits, and leaves the database a little cleaner.

Code source, demandes de modifications, etc.

L’outil désambiguïsateur d’auteur est hébergé sur ToolForge, et son code est géré sur un dépôt GitHub. Veuillez utiliser le système de tickets de GitHub pour suggérer des changements soumettre d’autres types de requêtes.