skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Annotation-free prediction of microbial dioxygen utilization
ABSTRACT Aerobes require dioxygen (O2) to grow; anaerobes do not. However, nearly all microbes—aerobes, anaerobes, and facultative organisms alike—express enzymes whose substrates include O2, if only for detoxification. This presents a challenge when trying to assess which organisms are aerobic from genomic data alone. This challenge can be overcome by noting that O2utilization has wide-ranging effects on microbes: aerobes typically have larger genomes encoding distinctive O2-utilizing enzymes, for example. These effects permit high-quality prediction of O2utilization from annotated genome sequences, with several models displaying ≈80% accuracy on a ternary classification task for which blind guessing is only 33% accurate. Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well. We discovered that simple and efficient models based entirely on genomic sequence content—e.g., triplets of amino acids—perform as well as intensive annotation-based classifiers, enabling rapid processing of genomes. We further show that amino acid trimers are useful because they encode information about protein composition and phylogeny. To showcase the utility of rapid prediction, we estimated the prevalence of aerobes and anaerobes in diverse natural environments cataloged in the Earth Microbiome Project. Focusing on a well-studied O2gradient in the Black Sea, we found quantitative correspondence between local chemistry (O2:sulfide concentration ratio) and the composition of microbial communities. We, therefore, suggest that statistical methods like ours might be used to estimate, or “sense,” pivotal features of the chemical environment using DNA sequencing data.IMPORTANCEWe now have access to sequence data from a wide variety of natural environments. These data document a bewildering diversity of microbes, many known only from their genomes. Physiology—an organism’s capacity to engage metabolically with its environment—may provide a more useful lens than taxonomy for understanding microbial communities. As an example of this broader principle, we developed algorithms that accurately predict microbial dioxygen utilization directly from genome sequences without annotating genes, e.g., by considering only the amino acids in protein sequences. Annotation-free algorithms enable rapid characterization of natural samples, highlighting quantitative correspondence between sequences and local O2levels in a data set from the Black Sea. This example suggests that DNA sequencing might be repurposed as a multi-pronged chemical sensor, estimating concentrations of O2and other key facets of complex natural settings.  more » « less
Award ID(s):
2127442 2127445
PAR ID:
10563745
Author(s) / Creator(s):
; ; ; ; ; ;
Editor(s):
Greening, Chris
Publisher / Repository:
ASM
Date Published:
Journal Name:
mSystems
Volume:
9
Issue:
10
ISSN:
2379-5077
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Biddle, Jennifer F (Ed.)
    ABSTRACT Heterotrophic marine bacteria utilize and recycle dissolved organic matter (DOM), impacting biogeochemical cycles. It is currently unclear to what extent distinct DOM components can be used by different heterotrophic clades. Here, we ask how a natural microbial community from the Eastern Mediterranean Sea (EMS) responds to different molecular classes of DOM (peptides, amino acids, amino sugars, disaccharides, monosaccharides, and organic acids) comprising much of the biomass of living organisms. Bulk bacterial activity increased after 24 h for all treatments relative to the control, while glucose and ATP uptake decreased or remained unchanged. Moreover, while the per-cell uptake rate of glucose and ATP decreased, that of Leucin significantly increased for amino acids, reflecting their importance as common metabolic currencies in the marine environment.Pseudoalteromonadaceaedominated the peptides treatment, while differentVibrionaceaestrains became dominant in response to amino acids and amino sugars.Marinomonadaceaegrew well on organic acids, andAlteromonadaseaeon disaccharides. A comparison with a recent laboratory-based study reveals similar peptide preferences forPseudoalteromonadaceae, whileAlteromonadaceae, for example, grew well in the lab on many substrates but dominated in seawater samples only when disaccharides were added. We further demonstrate a potential correlation between the genetic capacity for degrading amino sugars and the dominance of specific clades in these treatments. These results highlight the diversity in DOM utilization among heterotrophic bacteria and complexities in the response of natural communities. IMPORTANCEA major goal of microbial ecology is to predict the dynamics of natural communities based on the identity of the organisms, their physiological traits, and their genomes. Our results show that several clades of heterotrophic bacteria each grow in response to one or more specific classes of organic matter. For some clades, but not others, growth in a complex community is similar to that of isolated strains in laboratory monoculture. Additionally, by measuring how the entire community responds to various classes of organic matter, we show that these results are ecologically relevant, and propose that some of these resources are utilized through common uptake pathways. Tracing the path between different resources to the specific microbes that utilize them, and identifying commonalities and differences between different natural communities and between them and lab cultures, is an important step toward understanding microbial community dynamics and predicting how communities will respond to perturbations. 
    more » « less
  2. Abstract The rapid growth of uncharacterized enzymes and their functional diversity urge accurate and trustworthy computational functional annotation tools. However, current state-of-the-art models lack trustworthiness on the prediction of the multilabel classification problem with thousands of classes. Here, we demonstrate that a novel evidential deep learning model (named ECPICK) makes trustworthy predictions of enzyme commission (EC) numbers with data-driven domain-relevant evidence, which results in significantly enhanced predictive power and the capability to discover potential new motif sites. ECPICK learns complex sequential patterns of amino acids and their hierarchical structures from 20 million enzyme data. ECPICK identifies significant amino acids that contribute to the prediction without multiple sequence alignment. Our intensive assessment showed not only outstanding enhancement of predictive performance on the largest databases of Uniprot, Protein Data Bank (PDB) and Kyoto Encyclopedia of Genes and Genomes (KEGG), but also a capability to discover new motif sites in microorganisms. ECPICK is a reliable EC number prediction tool to identify protein functions of an increasing number of uncharacterized enzymes. 
    more » « less
  3. Abstract Anaerobes thrive in the absence of oxygen and are an untapped reservoir of biotechnological potential. Therefore, bioprospecting efforts focused on anaerobic microbial diversity could rapidly uncover new enzymes, pathways, and chassis organisms to drive biotechnology innovation. Despite their potential utility, anaerobic fermenters are viewed as inefficient from a biochemical perspective because their metabolisms produce fewer ATP (~2) per molecule of glucose processed than heterotrophic respirers (~32–38 ATP). While aerobes excel at ATP generation, they are often less efficient than anaerobes at processes that compete with ATP generation for cellular resources. This perspective highlights how anaerobic adaptations are advantageous for synthetic biology and biomanufacturing applications through the engineering of microbial cell factories. We further highlight emerging applications of anaerobic bioprocessing, including the use of anaerobic metabolisms for lignocellulosic bioprocessing, human and environmental health, and value‐added bioproduction. 
    more » « less
  4. Dinoflagellates of the family Symbiodiniaceae are predominantly essential symbionts of corals and other marine organisms. Recent research reveals extensive genome sequence divergence among Symbiodiniaceae taxa and high phylogenetic diversity hidden behind subtly different cell morphologies. Using an alignment-free phylogenetic approach based on sub-sequences of fixed length k (i.e. k -mers), we assessed the phylogenetic signal among whole-genome sequences from 16 Symbiodiniaceae taxa (including the genera of Symbiodinium , Breviolum , Cladocopium , Durusdinium and Fugacium ) and two strains of Polarella glacialis as outgroup. Based on phylogenetic trees inferred from k -mers in distinct genomic regions (i.e. repeat-masked genome sequences, protein-coding sequences, introns and repeats) and in protein sequences, the phylogenetic signal associated with protein-coding DNA and the encoded amino acids is largely consistent with the Symbiodiniaceae phylogeny based on established markers, such as large subunit rRNA. The other genome sequences (introns and repeats) exhibit distinct phylogenetic signals, supporting the expected differential evolutionary pressure acting on these regions. Our analysis of conserved core k -mers revealed the prevalence of conserved k -mers (>95% core 23-mers among all 18 genomes) in annotated repeats and non-genic regions of the genomes. We observed 180 distinct repeat types that are significantly enriched in genomes of the symbiotic versus free-living Symbiodinium taxa, suggesting an enhanced activity of transposable elements linked to the symbiotic lifestyle. We provide evidence that representation of alignment-free phylogenies as dynamic networks enhances the ability to generate new hypotheses about genome evolution in Symbiodiniaceae. These results demonstrate the potential of alignment-free phylogenetic methods as a scalable approach for inferring comprehensive, unbiased whole-genome phylogenies of dinoflagellates and more broadly of microbial eukaryotes. 
    more » « less
  5. Abstract BackgroundAdvances in microbiome science are being driven in large part due to our ability to study and infer microbial ecology from genomes reconstructed from mixed microbial communities using metagenomics and single-cell genomics. Such omics-based techniques allow us to read genomic blueprints of microorganisms, decipher their functional capacities and activities, and reconstruct their roles in biogeochemical processes. Currently available tools for analyses of genomic data can annotate and depict metabolic functions to some extent; however, no standardized approaches are currently available for the comprehensive characterization of metabolic predictions, metabolite exchanges, microbial interactions, and microbial contributions to biogeochemical cycling. ResultsWe present METABOLIC (METabolic And BiogeOchemistry anaLyses In miCrobes), a scalable software to advance microbial ecology and biogeochemistry studies using genomes at the resolution of individual organisms and/or microbial communities. The genome-scale workflow includes annotation of microbial genomes, motif validation of biochemically validated conserved protein residues, metabolic pathway analyses, and calculation of contributions to individual biogeochemical transformations and cycles. The community-scale workflow supplements genome-scale analyses with determination of genome abundance in the microbiome, potential microbial metabolic handoffs and metabolite exchange, reconstruction of functional networks, and determination of microbial contributions to biogeochemical cycles. METABOLIC can take input genomes from isolates, metagenome-assembled genomes, or single-cell genomes. Results are presented in the form of tables for metabolism and a variety of visualizations including biogeochemical cycling potential, representation of sequential metabolic transformations, community-scale microbial functional networks using a newly defined metric “MW-score” (metabolic weight score), and metabolic Sankey diagrams. METABOLIC takes ~ 3 h with 40 CPU threads to process ~ 100 genomes and corresponding metagenomic reads within which the most compute-demanding part of hmmsearch takes ~ 45 min, while it takes ~ 5 h to complete hmmsearch for ~ 3600 genomes. Tests of accuracy, robustness, and consistency suggest METABOLIC provides better performance compared to other software and online servers. To highlight the utility and versatility of METABOLIC, we demonstrate its capabilities on diverse metagenomic datasets from the marine subsurface, terrestrial subsurface, meadow soil, deep sea, freshwater lakes, wastewater, and the human gut. ConclusionMETABOLIC enables the consistent and reproducible study of microbial community ecology and biogeochemistry using a foundation of genome-informed microbial metabolism, and will advance the integration of uncultivated organisms into metabolic and biogeochemical models. METABOLIC is written in Perl and R and is freely available under GPLv3 athttps://github.com/AnantharamanLab/METABOLIC. 
    more » « less