Author Name Disambiguation (AND) is the task
of clustering unique author names from publication records
in scholarly or related databases. Although AND has been
extensively studied and has served as an important preprocessing
step for several tasks (e.g. calculating bibliometrics and scientometrics
for authors), there are few publicly available tools for
disambiguation in large-scale scholarly databases. Furthermore,
most of the disambiguated data is embedded within the search
engines of the scholarly databases, and existing application
programming interfaces (APIs) have limited features and are
often unavailable for users for various reasons. This makes it
difficult for researchers and developers to use the data for various
applications (e.g. author search) or research. Here, we design
a novel, web-based, RESTful API for searching disambiguated
authors, using the PubMed database as a sample application. We
offer two type of queries, attribute-based queries and record-based
queries which serve different purposes. Attribute-based
queries retrieve authors with the attributes available in the
database. We study different search engines to find the most
appropriate one for processing attribute-based queries. Record-based
queries retrieve authors that are most likely to have
written a query publication provided by a user. To accelerate
record-based queries, we develop a novel algorithm that has a
fast record-to-cluster match. We show that our algorithm can
accelerate the query by a factor of 4.01 compared to a baseline
naive approach.
more »
« less
A Multi-match Approach to the Author Uncertainty Problem
Abstract Purpose The ability to identify the scholarship of individual authors is essential for performance evaluation. A number of factors hinder this endeavor. Common and similarly spelled surnames make it difficult to isolate the scholarship of individual authors indexed on large databases. Variations in name spelling of individual scholars further complicates matters. Common family names in scientific powerhouses like China make it problematic to distinguish between authors possessing ubiquitous and/or anglicized surnames (as well as the same or similar first names). The assignment of unique author identifiers provides a major step toward resolving these difficulties. We maintain, however, that in and of themselves, author identifiers are not sufficient to fully address the author uncertainty problem. In this study we build on the author identifier approach by considering commonalities in fielded data between authors containing the same surname and first initial of their first name. We illustrate our approach using three case studies. Design/methodology/approach The approach we advance in this study is based on commonalities among fielded data in search results. We cast a broad initial net—i.e., a Web of Science (WOS) search for a given author’s last name, followed by a comma, followed by the first initial of his or her first name (e.g., a search for ‘John Doe’ would assume the form: ‘Doe, J’). Results for this search typically contain all of the scholarship legitimately belonging to this author in the given database (i.e., all of his or her true positives), along with a large amount of noise, or scholarship not belonging to this author (i.e., a large number of false positives). From this corpus we proceed to iteratively weed out false positives and retain true positives. Author identifiers provide a good starting point—e.g., if ‘Doe, J’ and ‘Doe, John’ share the same author identifier, this would be sufficient for us to conclude these are one and the same individual. We find email addresses similarly adequate—e.g., if two author names which share the same surname and same first initial have an email address in common, we conclude these authors are the same person. Author identifier and email address data is not always available, however. When this occurs, other fields are used to address the author uncertainty problem. Commonalities among author data other than unique identifiers and email addresses is less conclusive for name consolidation purposes. For example, if ‘Doe, John’ and ‘Doe, J’ have an affiliation in common, do we conclude that these names belong the same person? They may or may not; affiliations have employed two or more faculty members sharing the same last and first initial. Similarly, it’s conceivable that two individuals with the same last name and first initial publish in the same journal, publish with the same co-authors, and/or cite the same references. Should we then ignore commonalities among these fields and conclude they’re too imprecise for name consolidation purposes? It is our position that such commonalities are indeed valuable for addressing the author uncertainty problem, but more so when used in combination. Our approach makes use of automation as well as manual inspection, relying initially on author identifiers, then commonalities among fielded data other than author identifiers, and finally manual verification. To achieve name consolidation independent of author identifier matches, we have developed a procedure that is used with bibliometric software called VantagePoint (see www.thevantagepoint.com) While the application of our technique does not exclusively depend on VantagePoint, it is the software we find most efficient in this study. The script we developed to implement this procedure is designed to implement our name disambiguation procedure in a way that significantly reduces manual effort on the user’s part. Those who seek to replicate our procedure independent of VantagePoint can do so by manually following the method we outline, but we note that the manual application of our procedure takes a significant amount of time and effort, especially when working with larger datasets. Our script begins by prompting the user for a surname and a first initial (for any author of interest). It then prompts the user to select a WOS field on which to consolidate author names. After this the user is prompted to point to the name of the authors field, and finally asked to identify a specific author name (referred to by the script as the primary author) within this field whom the user knows to be a true positive (a suggested approach is to point to an author name associated with one of the records that has the author’s ORCID iD or email address attached to it). The script proceeds to identify and combine all author names sharing the primary author’s surname and first initial of his or her first name who share commonalities in the WOS field on which the user was prompted to consolidate author names. This typically results in significant reduction in the initial dataset size. After the procedure completes the user is usually left with a much smaller (and more manageable) dataset to manually inspect (and/or apply additional name disambiguation techniques to). Research limitations Match field coverage can be an issue. When field coverage is paltry dataset reduction is not as significant, which results in more manual inspection on the user’s part. Our procedure doesn’t lend itself to scholars who have had a legal family name change (after marriage, for example). Moreover, the technique we advance is (sometimes, but not always) likely to have a difficult time dealing with scholars who have changed careers or fields dramatically, as well as scholars whose work is highly interdisciplinary. Practical implications The procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research, especially when the name under consideration is a more common family name. It is more effective when match field coverage is high and a number of match fields exist. Originality/value Once again, the procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research. It combines preexisting with more recent approaches, harnessing the benefits of both. Findings Our study applies the name disambiguation procedure we advance to three case studies. Ideal match fields are not the same for each of our case studies. We find that match field effectiveness is in large part a function of field coverage. Comparing original dataset size, the timeframe analyzed for each case study is not the same, nor are the subject areas in which they publish. Our procedure is more effective when applied to our third case study, both in terms of list reduction and 100% retention of true positives. We attribute this to excellent match field coverage, and especially in more specific match fields, as well as having a more modest/manageable number of publications. While machine learning is considered authoritative by many, we do not see it as practical or replicable. The procedure advanced herein is both practical, replicable and relatively user friendly. It might be categorized into a space between ORCID and machine learning. Machine learning approaches typically look for commonalities among citation data, which is not always available, structured or easy to work with. The procedure we advance is intended to be applied across numerous fields in a dataset of interest (e.g. emails, coauthors, affiliations, etc.), resulting in multiple rounds of reduction. Results indicate that effective match fields include author identifiers, emails, source titles, co-authors and ISSNs. While the script we present is not likely to result in a dataset consisting solely of true positives (at least for more common surnames), it does significantly reduce manual effort on the user’s part. Dataset reduction (after our procedure is applied) is in large part a function of (a) field availability and (b) field coverage.
more »
« less
- Award ID(s):
- 1645237
- NSF-PAR ID:
- 10174065
- Date Published:
- Journal Name:
- Journal of Data and Information Science
- Volume:
- 4
- Issue:
- 2
- ISSN:
- 2543-683X
- Page Range / eLocation ID:
- 1 to 18
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)“What is crucial for your ability to communicate with me… pivots on the recipient’s capacity to interpret—to make good inferential sense of the meanings that the declarer is able to send” (Rescher 2000, p148). Conventional approaches to reconciling taxonomic information in biodiversity databases have been based on string matching for unique taxonomic name combinations (Kindt 2020, Norman et al. 2020). However, in their original context, these names pertain to specific usages or taxonomic concepts, which can subsequently vary for the same name as applied by different authors. Name-based synonym matching is a helpful first step (Guala 2016, Correia et al. 2018), but may still leave considerable ambiguity regarding proper usage (Fig. 1). Therefore, developing "taxonomic intelligence" is the bioinformatic challenge to adequately represent, and subsequently propagate, this complex name/usage interaction across trusted biodiversity data networks. How do we ensure that senders and recipients of biodiversity data not only can share messages but do so with “good inferential sense” of their respective meanings? Key obstacles have involved dealing with the complexity of taxonomic name/usage modifications through time, both in terms of accounting for and digitally representing the long histories of taxonomic change in most lineages. An important critique of proposals to use name-to-usage relationships for data aggregation has been the difficulty of scaling them up to reach comprehensive coverage, in contrast to name-based global taxonomic hierarchies (Bisby 2011). The Linnaean system of nomenclature has some unfortunate design limitations in this regard, in that taxonomic names are not unique identifiers, their meanings may change over time, and the names as a string of characters do not encode their proper usage, i.e., the name “Genus species” does not specify a source defining how to use the name correctly (Remsen 2016, Sterner and Franz 2017). In practice, many people provide taxonomic names in their datasets or publications but not a source specifying a usage. The information needed to map the relationships between names and usages in taxonomic monographs or revisions is typically not presented it in a machine-readable format. New approaches are making progress on these obstacles. Theoretical advances in the representation of taxonomic intelligence have made it increasingly possible to implement efficient querying and reasoning methods on name-usage relationships (Chen et al. 2014, Chawuthai et al. 2016, Franz et al. 2015). Perhaps most importantly, growing efforts to produce name-usage mappings on a medium scale by data providers and taxonomic authorities suggest an all-or-nothing approach is not required. Multiple high-profile biodiversity databases have implemented internal tools for explicitly tracking conflicting or dynamic taxonomic classifications, including eBird using concept relationships from AviBase (Lepage et al. 2014); NatureServe in its Biotics database; iNaturalist using its taxon framework (Loarie 2020); and the UNITE database for fungi (Nilsson et al. 2019). Other ongoing projects incorporating taxonomic intelligence include the Flora of Alaska (Flora of Alaska 2020), the Mammal Diversity Database (Mammal Diversity Database 2020) and PollardBase for butterfly population monitoring (Campbell et al. 2020).more » « less
-
null (Ed.)Chinese author names are known to be more difficult to disambiguate than other ethnic names because they tend to share surnames and forenames, thus creating many homonyms. In this study, we demonstrate how using Chinese characters can affect machine learning for author name disambiguation. For analysis, 15K author names recorded in Chinese are transliterated into English and simplified by initialising their forenames to create counterfactual scenarios, reflecting real-world indexing practices in which Chinese characters are usually unavailable. The results show that Chinese author names that are highly ambiguous in English or with initialised forenames tend to become less confusing if their Chinese characters are included in the processing. Our findings indicate that recording Chinese author names in native script can help researchers and digital libraries enhance authority control of Chinese author names that continue to increase in size in bibliographic data.more » « less
-
null (Ed.)Abstract How can we evaluate the performance of a disambiguation method implemented on big bibliographic data? This study suggests that the open researcher profile system, ORCID, can be used as an authority source to label name instances at scale. This study demonstrates the potential by evaluating the disambiguation performances of Author-ity2009 (which algorithmically disambiguates author names in MEDLINE) using 3 million name instances that are automatically labeled through linkage to 5 million ORCID researcher profiles. Results show that although ORCID-linked labeled data do not effectively represent the population of name instances in Author-ity2009, they do effectively capture the ‘high precision over high recall’ performances of Author-ity2009. In addition, ORCID-linked labeled data can provide nuanced details about the Author-ity2009’s performance when name instances are evaluated within and across ethnicity categories. As ORCID continues to be expanded to include more researchers, labeled data via ORCID-linkage can be improved in representing the population of a whole disambiguated data and updated on a regular basis. This can benefit author name disambiguation researchers and practitioners who need large-scale labeled data but lack resources for manual labeling or access to other authority sources for linkage-based labeling. The ORCID-linked labeled data for Author-ity2009 are publicly available for validation and reuse.more » « less
-
Romanach, Stephanie S. (Ed.)Massive biological databases of species occurrences, or georeferenced locations where a species has been observed, are essential inputs for modeling present and future species distributions. Location accuracy is often assessed by determining whether the observation geocoordinates fall within the boundaries of the declared political divisions. This otherwise simple validation is complicated by the difficulty of matching political division names to the correct geospatial object. Spelling errors, abbreviations, alternative codes, and synonyms in multiple languages present daunting name disambiguation challenges. The inability to resolve political division names reduces usable data, and analysis of erroneous observations can lead to flawed results. Here, we present the Geographic Name Resolution Service (GNRS), an application for correcting, standardizing, and indexing world political division names. The GNRS resolves political division names against a reference database that combines names and codes from GeoNames with geospatial object identifiers from the Global Administrative Areas Database (GADM). In a trial resolution of political division names extracted from >270 million species occurrences, only 1.9%, representing just 6% of occurrences, matched exactly to GADM political divisions in their original form. The GNRS was able to resolve, completely or in part, 92% of the remaining 378,568 political division names, or 86% of the full biodiversity occurrence dataset. In assessing geocoordinate accuracy for >239 million species occurrences, resolution of political divisions by the GNRS enabled the detection of an order of magnitude more errors and an order of magnitude more error-free occurrences. By providing a novel solution to a significant data quality impediment, the GNRS liberates a tremendous amount of biodiversity data for quantitative biodiversity research. The GNRS runs as a web service and is accessible via an API, an R package, and a web-based graphical user interface. Its modular architecture is easily integrated into existing data validation workflows.more » « less