skip to main content

Title: LINgroups as a principled approach to compare and integrate multiple bacterial taxonomies
Traditional taxonomy provides a hierarchical organization of bacte- ria and archaea across taxonomic ranks from kingdom to subspecies. More recently, bacterial taxonomy has been more robustly quanti- fied using comparisons of sequenced genomes, as in the Genome Taxonomy Database (GTDB), resolving down to genera and species. Such taxonomies have proven useful in many contexts, yet lack the flexibility and resolution of a more fine-grained approach. We apply our Life Identification Number (LIN) approach as a com- mon, quantitative framework to tie existing (and future) bacterial taxonomies together, increase the resolution of genome-based dis- crimination of taxa, and extend taxonomic identification below the species level in a principled way. We utilize our existing concept of a LINgroup as an organizational concept for microorganisms that are closely related by overall genomic similarity, to help resolve some of the confusions and unforeseen negative effects of nomen- clature changes of microbes due to genome-based reclassification. Our results obtained from experimentation demonstrate the value of LINs and LINgroups in mapping between taxonomies, translat- ing between different nomenclatures, and integrating them into a single taxonomic framework.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
Page Range / eLocation ID:
1 to 7
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT Genus assignment is fundamental in the characterization of microbes, yet there is currently no unambiguous way to demarcate genera solely using standard genomic relatedness indices. Here, we propose an approach to demarcate genera that relies on the combined use of the average nucleotide identity, genome alignment fraction, and the distinction between type- and non-type species. More than 3,500 genomes representing type strains of species from >850 genera of either bacterial or archaeal lineages were tested. Over 140 genera were analyzed in detail within the taxonomic context of order/family. Significant genomic differences between members of a genus and type species of other genera in the same order/family were conserved in 94% of the cases. Nearly 90% (92% if polyphyletic genera are excluded) of the type strains were classified in agreement with current taxonomy. The 448 type strains that need reclassification directly impact 33% of the genera analyzed in detail. The results provide a first line of evidence that the combination of genomic indices provides added resolution to effectively demarcate genera within the taxonomic framework that is currently based on the 16S rRNA gene. We also identify the emergence of natural breakpoints at the genome level that can further help in the circumscription of taxa, increasing the proportion of directly impacted genera to at least 43% and pointing at inaccuracies on the use of the 16S rRNA gene as a taxonomic marker, despite its precision. Altogether, these results suggest that genomic coherence is an emergent property of genera in Bacteria and Archaea . IMPORTANCE In recent decades, the taxonomy of Bacteria and Archaea , and therefore genus designation, has been largely based on the use of a single ribosomal gene, the 16S rRNA gene, as a taxonomic marker. We propose an approach to delineate genera that excludes the direct use of the 16S rRNA gene and focuses on a standard genome relatedness index, the average nucleotide identity. Our findings are of importance to the microbiology community because the emergent properties of Bacteria and Archaea that are identified in this study will help assign genera with higher taxonomic resolution. 
    more » « less
  2. Taxonomies are fundamental to many real-world applications in various domains, serving as structural representations of knowledge. To deal with the increasing volume of new concepts needed to be organized as taxonomies, researchers turn to automatically completion of an existing taxonomy with new concepts. In this paper, we propose TaxoEnrich, a new taxonomy completion framework, which effectively leverages both semantic features and structural information in the existing taxonomy and offers a better representation of candidate position to boost the performance of taxonomy completion. Specifically, TaxoEnrich consists of four components: (1) taxonomy-contextualized embedding which incorporates both semantic meanings of concept and taxonomic relations based on powerful pretrained language models; (2) a taxonomy-aware sequential encoder which learns candidate position representations by encoding the structural information of taxonomy; (3) a query-aware sibling encoder which adaptively aggregates candidate siblings to augment candidate position representations based on their importance to the query-position matching; (4) a query-position matching model which extends existing work with our new candidate position representations. Extensive experiments on four large real-world datasets from different domains show that TaxoEnrich achieves the best performance among all evaluation metrics and outperforms previous state-of-the-art methods by a large margin. 
    more » « less
  3. All life on earth is linked by a shared evolutionary history. Even before Darwin developed the theory of evolution, Linnaeus categorized types of organisms based on their shared traits. We now know these traits derived from these species’ shared ancestry. This evolutionary history provides a natural framework to harness the enormous quantities of biological data being generated today. The Open Tree of Life project is a collaboration developing tools to curate and share evolutionary estimates (phylogenies) covering the entire tree of life (Hinchliff et al. 2015, McTavish et al. 2017). The tree is viewable at, and the data is all freely available online. The taxon identifiers used in the Open Tree unified taxonomy (Rees and Cranston 2017) are mapped to identifiers across biological informatics databases, including the Global Biodiversity Information Facility (GBIF), NCBI, and others. Linking these identifiers allows researchers to easily unify data from across these different resources (Fig. 1). Leveraging a unified evolutionary framework across the diversity of life provides new avenues for integrative wide scale research. Downstream tools, such as R packages developed by the R OpenSci foundation (rotl, rgbif) (Michonneau et al. 2016, Chamberlain 2017) and others tools (Revell 2012), make accessing and combining this information straightforward for students as well as researchers (e.g. Figure 1. Example linking phylogenetic relationships accessed from the Open Tree of Life with specimen location data from Global Biodiversity Information Facility. For example, a recent publication by Santorelli et al. 2018 linked evolutionary information from Open Tree with species locality data gathered from a local field study as well as GBIF species location records to test a river-barrier hypothesis in the Amazon. By combining these data, the authors were able test a widely held biogeographic hypothesis across 1952 species in 14 taxonomic groups, and found that a river that had been postulated to drive endemism, was in fact not a barrier to gene flow. However, data provenance and taxonomic name reconciliation remain key hurdles to applying data from these large digital biodiversity and evolution community resources to answering biological questions. In the Amazonian river analysis, while they leveraged use of GBIF records as a secondary check on their species records, they relied on their an intensive local field study for their major conclusions, and preferred taxon specific phylogenetic resources over Open Tree where they were available (Santorelli et al. 2018). When Li et al. 2018 assessed large scale phylogenetic approaches, including Open Tree, for measuring community diversity, they found that synthesis phylogenies were less resolved than purpose-built phylogenies, but also found that these synthetic phylogenies were sufficient for community level phylogenetic diversity analyses. Nonetheless, data quality concerns have limited adoption of analyses data from centralized resources (McTavish et al. 2017). Taxonomic name recognition and reconciliation across databases also remains a hurdle for large scale analyses, despite several ongoing efforts to improve taxonomic interoperability and unify taxonomies, such at Catalogue of Life + (Bánki et al. 2018). In order to support innovative science, large scale digital data resources need to facilitate data linkage between resources, and address researchers' data quality and provenance concerns. I will present the model that the Open Tree of Life is using to provide evolutionary data at the scale of the entire tree of life, while maintaining traceable provenance to the publications and taxonomies these evolutionary relationships are inferred from. I will discuss the hurdles to adoption of these large scale resources by researchers, as well as the opportunities for new research avenues provided by the connections between evolutionary inferences and biodiversity digital databases. 
    more » « less
  4. Genomics has put prokaryotic rank-based taxonomy on a solid phylogenetic foundation. However, most taxonomic ranks were set long before the advent of DNA sequencing and genomics. In this concept paper, we thus ask the following question: should prokaryotic classification schemes besides the current phylum-to-species ranks be explored, developed, and incorporated into scientific discourse? Could such alternative schemes provide better solutions to the basic need of science and society for which taxonomy was developed, namely, precise and meaningful identification? A neutral genome-similarity based framework is then described that could allow alternative classification schemes to be explored, compared, and translated into each other without having to choose only one as the gold standard. Classification schemes could thus continue to evolve and be selected according to their benefits and based on how well they fulfill the need for prokaryotic identification. 
    more » « less
  5. Taxonomic classification of archaeal and bacterial viruses is challenging, yet also fundamental for developing a predictive understanding of microbial ecosystems. Recent identification of hundreds of thousands of new viral genomes and genome fragments, whose hosts remain unknown, requires a paradigm shift away from traditional classification approaches and towards the use of genomes for taxonomy. Here we revisited the use of genomes and their protein content as a means for developing a viral taxonomy for bacterial and archaeal viruses. A network-based analytic was evaluated and benchmarked against authority-accepted taxonomic assignments and found to be largely concordant. Exceptions were manually examined and found to represent areas of viral genome ‘sequence space’ that are under-sampled or prone to excessive genetic exchange. While both cases are poorly resolved by genome-based taxonomic approaches, the former will improve as viral sequence space is better sampled and the latter are uncommon. Finally, given the largely robust taxonomic capabilities of this approach, we sought to enable researchers to easily and systematically classify new viruses. Thus, we established a tool, vConTACT, as an app at iVirus, where it operates as a fast, highly scalable, user-friendly app within the free and powerful CyVerse cyberinfrastructure.

    more » « less