skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Building a community-based taxonomic resource for digitization of parasites and their hosts
Abstract Classification of the biological diversity on Earth is foundational to all areas of research within the natural sciences. Reliable biological nomenclatural and taxonomic systems facilitate efficient access to information about organisms and their names over time. However, broadly sharing, accessing, delivering, and updating these resources remains a persistent problem. This barrier has been acknowledged by the biodiversity data sharing community, yet concrete efforts to standardize and continually update taxonomic names in a sustainable way remain limited. High diversity groups such as arthropods are especially challenging as available specimen data per number of species is substantially lower than vertebrate or plant groups. The Terrestrial Parasite Tracker Thematic Collections Network project developed a workflow for gathering expert-verified taxonomic names across all available sources, aligning those sources, and publishing a single resource that provides a model for future endeavors to standardize digital specimen identification data. The process involved gathering expert-verified nomenclature lists representing the full taxonomic scope of terrestrial arthropod parasites, documenting issues experienced, and finding potential solutions for reconciliation of taxonomic resources against large data publishers. Although discordance between our expert resources and the Global Biodiversity Information Facility are relatively low, the impact across all taxa affects thousands of names that correspond to hundreds of thousands of specimen records. Here, we demonstrate a mechanism for the delivery and continued maintenance of these taxonomic resources, while highlighting the current state of taxon name curation for biodiversity data sharing.  more » « less
Award ID(s):
1902031 2328119 1901923
PAR ID:
10475129
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Insect Systematics and Diversity
Volume:
7
Issue:
6
ISSN:
2399-3421
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Taxonomy is foundational to all biological sciences. Names allow us to organize and communicate information about biological groups. This process is critical for understanding and preserving the biodiversity of our planet. There are an estimated 8.7 million extant eukaryotes (Mora et al. 2011) and possibly as many as 1 trillion microbial species (Locey and Lennon 2016), with untold numbers of extinct taxa yet to be discovered in the fossil record. Accounting for all these taxa and maintaining their nomenclatural resources is one of the great challenges in biology. A few major hurdles in overcoming this challenge are the inability to find, share, and update taxonomic resources efficiently in real time. Efforts to standardize and continually update taxonomic names in a sustainable way have been limited. The problem is complex, and solutions must deal with the large backlog of names, a constant stream of new names, the confusing merging and splitting of taxonomic synonyms, the subjective nature of taxonomic concepts, and the fundamental limitations on available expertise and curators' time to prepare and maintain such resources. Hyperdiverse groups such as arthropods are especially challenging as there are relatively few experts on any given lineage and changes in taxonomy can be rapid as new species are continually being discovered and described. After struggling to wrangle taxonomic resources in support of specimen digitization efforts, I began development of TaxoTracker as a proof-of-concept, web-based platform for facilitating expert curation and dissemination of biological taxonomies. TaxoTracker is still in development, but its current and planned functionalities will be shown through a combination of demonstration and discussion. TaxoTracker identifies and implements features that attempt to simplify the production and maintenance of expert-curated resources, while also limiting the responsibilities that are placed on individual experts who are often already overburdened and underfunded. These features include: Centralized, searchable, and easily obtained resources in useful formats Community-driven, citation-based curatorial suggestions Expert-reviewed curatorial recommendations Consensus-driven curatorial decisions Effort tracking and credit for suggestions and reviews Centralized, searchable, and easily obtained resources in useful formats Community-driven, citation-based curatorial suggestions Expert-reviewed curatorial recommendations Consensus-driven curatorial decisions Effort tracking and credit for suggestions and reviews 
    more » « less
  2. All life on earth is linked by a shared evolutionary history. Even before Darwin developed the theory of evolution, Linnaeus categorized types of organisms based on their shared traits. We now know these traits derived from these species’ shared ancestry. This evolutionary history provides a natural framework to harness the enormous quantities of biological data being generated today. The Open Tree of Life project is a collaboration developing tools to curate and share evolutionary estimates (phylogenies) covering the entire tree of life (Hinchliff et al. 2015, McTavish et al. 2017). The tree is viewable at https://tree.opentreeoflife.org, and the data is all freely available online. The taxon identifiers used in the Open Tree unified taxonomy (Rees and Cranston 2017) are mapped to identifiers across biological informatics databases, including the Global Biodiversity Information Facility (GBIF), NCBI, and others. Linking these identifiers allows researchers to easily unify data from across these different resources (Fig. 1). Leveraging a unified evolutionary framework across the diversity of life provides new avenues for integrative wide scale research. Downstream tools, such as R packages developed by the R OpenSci foundation (rotl, rgbif) (Michonneau et al. 2016, Chamberlain 2017) and others tools (Revell 2012), make accessing and combining this information straightforward for students as well as researchers (e.g. https://mctavishlab.github.io/BIO144/labs/rotl-rgbif.html). Figure 1. Example linking phylogenetic relationships accessed from the Open Tree of Life with specimen location data from Global Biodiversity Information Facility. For example, a recent publication by Santorelli et al. 2018 linked evolutionary information from Open Tree with species locality data gathered from a local field study as well as GBIF species location records to test a river-barrier hypothesis in the Amazon. By combining these data, the authors were able test a widely held biogeographic hypothesis across 1952 species in 14 taxonomic groups, and found that a river that had been postulated to drive endemism, was in fact not a barrier to gene flow. However, data provenance and taxonomic name reconciliation remain key hurdles to applying data from these large digital biodiversity and evolution community resources to answering biological questions. In the Amazonian river analysis, while they leveraged use of GBIF records as a secondary check on their species records, they relied on their an intensive local field study for their major conclusions, and preferred taxon specific phylogenetic resources over Open Tree where they were available (Santorelli et al. 2018). When Li et al. 2018 assessed large scale phylogenetic approaches, including Open Tree, for measuring community diversity, they found that synthesis phylogenies were less resolved than purpose-built phylogenies, but also found that these synthetic phylogenies were sufficient for community level phylogenetic diversity analyses. Nonetheless, data quality concerns have limited adoption of analyses data from centralized resources (McTavish et al. 2017). Taxonomic name recognition and reconciliation across databases also remains a hurdle for large scale analyses, despite several ongoing efforts to improve taxonomic interoperability and unify taxonomies, such at Catalogue of Life + (Bánki et al. 2018). In order to support innovative science, large scale digital data resources need to facilitate data linkage between resources, and address researchers' data quality and provenance concerns. I will present the model that the Open Tree of Life is using to provide evolutionary data at the scale of the entire tree of life, while maintaining traceable provenance to the publications and taxonomies these evolutionary relationships are inferred from. I will discuss the hurdles to adoption of these large scale resources by researchers, as well as the opportunities for new research avenues provided by the connections between evolutionary inferences and biodiversity digital databases. 
    more » « less
  3. Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones. As a motivating case, consider the abundantly sampled North American deer mouse— Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; GBIF.org 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021). Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020). 
    more » « less
  4. Abstract Biodiversity research has advanced by testing expectations of ecological and evolutionary hypotheses through the linking of large-scale genetic, distributional, and trait datasets. The rise of molecular systematics over the past 30 years has resulted in a wealth of DNA sequences from around the globe. Yet, advances in molecular systematics also have created taxonomic instability, as new estimates of evolutionary relationships and interpretations of species limits have required widespread scientific name changes. Taxonomic instability, colloquially “splits, lumps, and shuffles,” presents logistical challenges to large-scale biodiversity research because (1) the same species or sets of populations may be listed under different names in different data sources, or (2) the same name may apply to different sets of populations representing different taxonomic concepts. Consequently, distributional and trait data are often difficult to link directly to primary DNA sequence data without extensive and time-consuming curation. Here, we present RANT: Reconciliation of Avian NCBI Taxonomy. RANT applies taxonomic reconciliation to standardize avian taxon names in use in NCBI GenBank, a primary source of genetic data, to a widely used and regularly updated avian taxonomy: eBird/Clements. Of 14,341 avian species/subspecies names in GenBank, 11,031 directly matched an eBird/Clements; these link to more than 6 million nucleotide sequences. For the remaining unmatched avian names in GenBank, we used Avibase’s system of taxonomic concepts, taxonomic descriptions in Cornell’s Birds of the World, and DNA sequence metadata to identify corresponding eBird/Clements names. Reconciled names linked to more than 600,000 nucleotide sequences, ~9% of all avian sequences on GenBank. Nearly 10% of eBird/Clements names had nucleotide sequences listed under 2 or more GenBank names. Our taxonomic reconciliation is a first step towards rigorous and open-source curation of avian GenBank sequences and is available at GitHub, where it can be updated to correspond to future annual eBird/Clements taxonomic updates. 
    more » « less
  5. The National Ecological Observatory Network (NEON) is gathering select ecological and taxonomic data across 81 sites in the United States and Puerto Rico. Lichens are one of the organismal groups that NEON has not yet assessed across these sites. Here we sampled lichens at Ordway-Swisher Biological Station (OSBS), a NEON site in north central Florida, to provide a baseline survey of the commonly encountered macrolichens (foliose, fruticose, and squamulose lichens). Macrolichens represent a subset of observable lichens and are more commonly surveyed than crustose lichens. Seventy-four species of macrolichens were collected, including 25 occurrences that constitute new records for Putnam County, Florida. The lichen diversity at OSBS comprised approximately 30% of the macrolichen diversity known from the entire state of Florida. Fifty-four taxa are common in the state of Florida, 12 infrequent across the state, and eight are considered rare. Macrolichens were the seventh most species-rich taxonomic groups at OSBS and more diverse than the NEON focal groups of mammals and fish. Lastly, we suggest a theoretical roadmap for how lichenologists could work together with NEON to include lichens in future datasets. We hope that biologists focused on other key organismal groups will sample in NEON sites so that NEON data can be leveraged appropriately in future cross-taxon studies of biodiversity at the continental scale. 
    more » « less