skip to main content

This content will become publicly available on August 23, 2023

Title: TaxonWorks as a Tool for Managing Large Biodiversity Projects
Large systematic revisionary projects incorporating data for hundreds or thousands of taxa require an integrative approach, with a strong biodiversity-informatics core for efficient data management to facilitate research on the group. Our original biodiversity informatics platform, 3i (Internet-accessible Interactive Identification) combined a customized MS Access database backend with ASP-based web interfaces to support revisionary syntheses of several large genera of leafhopers (Hemiptera: Auchenorrhyncha: Cicadellidae). More recently, for our National Science Foundation sponsored project, “GoLife: Collaborative Research: Integrative genealogy, ecology and phenomics of deltocephaline leafhoppers (Hemiptera: Cicadellidae), and their microbial associates”, we selected the new open-source platform TaxonWorks as the cyberinfrastructure. In the scope of the project, the original “3i World Auchenorrhyncha Database” was imported into TaxonWorks. At the present time, TaxonWorks has many tools to automatically import nomenclature, citations, and specimen based collection data. At the time of the initial migration of the 3i database, many of those tools were still under development, and complexity of the data in the database required a custom migration script, which is still probably the most efficient solution for importing datasets with long development history. At the moment, the World Auchenorrhyncha Database comprehensively covers nomenclature of the group and includes data on 70 valid more » families, 6,816 valid genera, 47,064 valid species as well as synonymy and subsequent combinations (Fig. 1). In addition, many taxon records include the original citation, bibliography, type information, etymology, etc. The bibliography of the group includes 37,579 sources, about 1/3 of which are associated with PDF files. Species have distribution records, either derived from individual specimens or as country and state level asserted distribution, as well as biological associations indicating host plants, predators, and parasitoids. Observation matrices in TaxonWorks are designed to handle morphological data associated with taxa or specimens. The matrices may be used to automatically generate interactive identification keys and taxon descriptions. They can also be downloaded to be imported, for example, into Lucid builder, or to perform phylogenetic analysis using an external application. At the moment there are 36 matrices associated with the project. The observation matrix from GoLife project covers 798 taxa by 210 descriptors (most of which are qualitative multi-state morphological descriptors) (Fig. 2). Illustrations are provided for 9,886 taxa and organized in the specialized image matrix and could be used as a pictorial key for determination of species and taxa of a higher rank. For the phylogenetic analysis, a dataset was constructed for 730 terminal taxa and >160,000 nucleotide positions obtained using anchored hybrid enrichment of genomic DNA for a sample of leafhoppers from the subfamily Deltocephalinae and outgroups. The probe kit targets leafhopper genes, as well as some bacterial genes (endosymbionts and plant pathogens transmitted by leafhoppers). The maximum likelihood analyses of concatenated nucleotide and amino acid sequences as well as coalescent gene tree analysis yielded well-resolved phylogenetic trees (Cao et al. 2022). Raw sequence data have been uploaded to the Sequence Read Archive on GenBank. Occurrence and morphological data, as well as diagnostic images, for voucher specimens have been incorporated into TaxonWorks. Data in TaxonWorks could be exported in raw format, get accessed via Application Programming Interface (API), or be shared with external data aggregators like Catalogue of Life, GBIF, iDigBio. « less
Authors:
; ;
Award ID(s):
1639601
Publication Date:
NSF-PAR ID:
10383100
Journal Name:
Biodiversity Information Science and Standards
Volume:
6
ISSN:
2535-0897
Sponsoring Org:
National Science Foundation
More Like this
  1. TaxonWorks (http://taxonworks.org) is an integrated workbench for taxonomists and biodiversity scientists. It is designed to capture, organize, and enrich data, share and refine it with collaborators, and package it for analysis and publication. It is based on PostgreSQL (database) and the Ruby-on-Rails programming language and framework for developing web applications (https://github.com/SpeciesFileGroup/taxonworks). The TaxonWorks community is built around an open software ecosystem that facilitates participation at many levels. TaxonWorks is designed to serve both researchers who create and curate the data, as well as technical users, such as programmers and informatics specialists, who act as data consumers. TaxonWorks provides researchers with robust, user friendly interfaces based on well thought out customized workflows for efficient and validated data entry. It provides technical users database access through an application programming interface (API) that serves data in JSON format. The data model includes coverage for nearly all classes of data recorded in modern taxonomic treatments primary studies of biodiversity, including nomenclature, bibliography, specimens and collecting events, phylogenetic matrices and species descriptions, etc. The nomenclatural classes are based on the NOMEN ontology (https://github.com/SpeciesFileGroup/nomen).
  2. TaxonWorks is an integrated web-based application for practicing taxonomists and biodiversity specialists. It is focused on promoting collaboration between researchers and developers. TaxonWorks has a modular structure that enables various components of the application to target specific needs and requirements of different groups of users. Specific areas of interest may include nomenclature-related tasks (Yoder and Dmitriev 2021) designed to help assemble and validate scientific name checklists of a target group of organisms; and collection management tasks, including interfaces to create, filter, and edit collecting events, collection objects, and loans. This presentation focuses on matrix-related tools integrated into TaxonWorks. A matrix, which could either be used for phylogenetic analysis or to build an identification key, is structured as a table where columns represent numerous characters that could be used to describe a set of entities, taxa or specimens (presented as rows of the table). Each cell of the table may contain observations for specific character/entity combinations. TaxonWorks does not generate a table for each a particular matrix—all observations are stored as graphs. This structure allows building of a matrix of an unlimited size as well as reuse of individual observations in multiple matrices. For matrix columns, TaxonWorks supports a variety ofmore »different kinds of characters or descriptors: qualitative, presence/absence, quantitative, sample, gene, free text, and media. Each character may have specific properties, for example a qualitative descriptor may have numerous characters states, and a quantitative descriptor may have a measurement unit defined. For an entity in a matrix row, TaxonWorks supports either collection objects (specimens) or taxa as Operational Taxonomic Units (OTU). OTUs could either be linked to nomenclature or be stand alone entities (e.g., representing undescribed species). The matrix, once built, could serve several purposes. A matrix based on qualitative and quantitative characters could be used to build an interactive key (Fig. 1), construct standardized natural language descriptions for each entity, and determine a diagnosis (a minimal set of characters that separate one entity from all others). It could also be exported and used for phylogenetic analysis or to build an interactive key in an external application. TaxonWorks supports export files in several formats, including Nexus, TNT, NeXML. Application Programming Interfaces (API) are also available. A matrix based on media descriptors could be used as a pictorial identification tool (Fig. 2).« less
  3. The 3i World Auchenorrhyncha database (http://dmitriev.speciesfile.org) is being migrated into TaxonWorks (http://taxonworks.org) and comprises nomenclatural data for all known Auchenorrhyncha taxa (leafhoppers, planthoppers, treehoppers, cicadas, spittle bugs). Of all those scientific names, 8,700 are unique genus-group names (which include valid genera and subgenera as well as their synonyms). According to the Rules of Zoological Nomenclature, a properly formed species-group name when combined with a genus-group name must agree with the latter in gender if the species-group name is or ends with a Latin or Latinized adjective or participle. This provides a double challenge for researchers describing new or citing existing taxa. For each species, the knowledge about the part of speech is essential information (nouns do not change their form when associated with different generic names). For the genus, the knowledge of the gender is essential information. Every time the species is transferred from one genus to another, its ending may need to be transformed to make a proper new scientific name (a binominal name). In modern day practice, it is important, when establishing a new name, to provide information about etymology of this name and the ways it should be used in the future publications: the grammatical gender formore »a genus, and the part of speech for a species. The older names often do not provide enough information about their etymology to make proper construction of scientific names. That is why in the literature, we can find numerous cases where a scientific name is not formed in conformity to the Rules of Nomenclature. An attempt was made to resolve the etymology of the generic names in Auchenorrhyncha to unify and clarify nomenclatural issues in this group of insects. In TaxonWorks, the rules of nomenclature are defined using the NOMEN onthology (https://github.com/SpeciesFileGroup/nomen).« less
  4. Abstract

    Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, inmore »part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]

    « less
  5. All life on earth is linked by a shared evolutionary history. Even before Darwin developed the theory of evolution, Linnaeus categorized types of organisms based on their shared traits. We now know these traits derived from these species’ shared ancestry. This evolutionary history provides a natural framework to harness the enormous quantities of biological data being generated today. The Open Tree of Life project is a collaboration developing tools to curate and share evolutionary estimates (phylogenies) covering the entire tree of life (Hinchliff et al. 2015, McTavish et al. 2017). The tree is viewable at https://tree.opentreeoflife.org, and the data is all freely available online. The taxon identifiers used in the Open Tree unified taxonomy (Rees and Cranston 2017) are mapped to identifiers across biological informatics databases, including the Global Biodiversity Information Facility (GBIF), NCBI, and others. Linking these identifiers allows researchers to easily unify data from across these different resources (Fig. 1). Leveraging a unified evolutionary framework across the diversity of life provides new avenues for integrative wide scale research. Downstream tools, such as R packages developed by the R OpenSci foundation (rotl, rgbif) (Michonneau et al. 2016, Chamberlain 2017) and others tools (Revell 2012), make accessing and combining thismore »information straightforward for students as well as researchers (e.g. https://mctavishlab.github.io/BIO144/labs/rotl-rgbif.html). Figure 1. Example linking phylogenetic relationships accessed from the Open Tree of Life with specimen location data from Global Biodiversity Information Facility. For example, a recent publication by Santorelli et al. 2018 linked evolutionary information from Open Tree with species locality data gathered from a local field study as well as GBIF species location records to test a river-barrier hypothesis in the Amazon. By combining these data, the authors were able test a widely held biogeographic hypothesis across 1952 species in 14 taxonomic groups, and found that a river that had been postulated to drive endemism, was in fact not a barrier to gene flow. However, data provenance and taxonomic name reconciliation remain key hurdles to applying data from these large digital biodiversity and evolution community resources to answering biological questions. In the Amazonian river analysis, while they leveraged use of GBIF records as a secondary check on their species records, they relied on their an intensive local field study for their major conclusions, and preferred taxon specific phylogenetic resources over Open Tree where they were available (Santorelli et al. 2018). When Li et al. 2018 assessed large scale phylogenetic approaches, including Open Tree, for measuring community diversity, they found that synthesis phylogenies were less resolved than purpose-built phylogenies, but also found that these synthetic phylogenies were sufficient for community level phylogenetic diversity analyses. Nonetheless, data quality concerns have limited adoption of analyses data from centralized resources (McTavish et al. 2017). Taxonomic name recognition and reconciliation across databases also remains a hurdle for large scale analyses, despite several ongoing efforts to improve taxonomic interoperability and unify taxonomies, such at Catalogue of Life + (Bánki et al. 2018). In order to support innovative science, large scale digital data resources need to facilitate data linkage between resources, and address researchers' data quality and provenance concerns. I will present the model that the Open Tree of Life is using to provide evolutionary data at the scale of the entire tree of life, while maintaining traceable provenance to the publications and taxonomies these evolutionary relationships are inferred from. I will discuss the hurdles to adoption of these large scale resources by researchers, as well as the opportunities for new research avenues provided by the connections between evolutionary inferences and biodiversity digital databases.« less