skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Statistical prediction of microbial metabolic traits from genomes
The metabolic activity of microbial communities is central to their role in biogeochemical cycles, human health, and biotechnology. Despite the abundance of sequencing data characterizing these consortia, it remains a serious challenge to predict microbial metabolic traits from sequencing data alone. Here we culture 96 bacterial isolates individually and assay their ability to grow on 10 distinct compounds as a sole carbon source. Using these data as well as two existing datasets, we show that statistical approaches can accurately predict bacterial carbon utilization traits from genomes. First, we show that classifiers trained on gene content can accurately predict bacterial carbon utilization phenotypes by encoding phylogenetic information. These models substantially outperform predictions made by constraint-based metabolic models automatically constructed from genomes. This result solidifies our current knowledge about the strong connection between phylogeny and metabolic traits. However, phylogeny-based predictions fail to predict traits for taxa that are phylogenetically distant from any strains in the training set. To overcome this we train improved models on gene presence/absence to predict carbon utilization traits from gene content. We show that models that predict carbon utilization traits from gene presence/absence can generalize to taxa that are phylogenetically distant from the training set either by exploiting biochemical information for feature selection or by having sufficiently large datasets. In the latter case, we provide evidence that a statistical approach can identify putatively mechanistic genes involved in metabolic traits. Our study demonstrates the potential power for predicting microbial phenotypes from genotypes using statistical approaches.  more » « less
Award ID(s):
2317138 2117477
PAR ID:
10518421
Author(s) / Creator(s):
; ;
Editor(s):
Ouzounis, Christos A
Publisher / Repository:
Public Library of Science
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
19
Issue:
12
ISSN:
1553-7358
Page Range / eLocation ID:
e1011705
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The reconstruction of complete microbial metabolic pathways using ‘omics data from environmental samples remains challenging. Computational pipelines for pathway reconstruction that utilize machine learning methods to predict the presence or absence of KEGG modules in incomplete genomes are lacking. Here, we present MetaPathPredict, a software tool that incorporates machine learning models to predict the presence of complete KEGG modules within bacterial genomic datasets. Using gene annotation data and information from the KEGG module database, MetaPathPredict employs deep learning models to predict the presence of KEGG modules in a genome. MetaPathPredict can be used as a command line tool or as a Python module, and both options are designed to be run locally or on a compute cluster. Benchmarks show that MetaPathPredict makes robust predictions of KEGG module presence within highly incomplete genomes. 
    more » « less
  2. Tringe, Susannah Green (Ed.)
    ABSTRACT Below-ground carbon transformations that contribute to healthy soils represent a natural climate change mitigation, but newly acquired traits adaptive to climate stress may alter microbial feedback mechanisms. To better define microbial evolutionary responses to long-term climate warming, we study microorganisms from an ongoingin situsoil warming experiment where, for over three decades, temperate forest soils are continuously heated at 5°C above ambient. We hypothesize that across generations of chronic warming, genomic signatures within diverse bacterial lineages reflect adaptations related to growth and carbon utilization. From our bacterial culture collection isolated from experimental heated and control plots, we sequenced genomes representing dominant taxa sensitive to warming, including lineages of Actinobacteria, Alphaproteobacteria, and Betaproteobacteria. We investigated genomic attributes and functional gene content to identify signatures of adaptation. Comparative pangenomics revealed accessory gene clusters related to central metabolism, competition, and carbon substrate degradation, with few functional annotations explicitly associated with long-term warming. Trends in functional gene patterns suggest genomes from heated plots were relatively enriched in central carbohydrate and nitrogen metabolism pathways, while genomes from control plots were relatively enriched in amino acid and fatty acid metabolism pathways. We observed that genomes from heated plots had less codon bias, suggesting potential adaptive traits related to growth or growth efficiency. Codon usage bias varied for organisms with similar 16Srrnoperon copy number, suggesting that these organisms experience different selective pressures on growth efficiency. Our work suggests the emergence of lineage-specific trends as well as common ecological-evolutionary microbial responses to climate change.IMPORTANCEAnthropogenic climate change threatens soil ecosystem health in part by altering below-ground carbon cycling carried out by microbes. Microbial evolutionary responses are often overshadowed by community-level ecological responses, but adaptive responses represent potential changes in traits and functional potential that may alter ecosystem function. We predict that microbes are adapting to climate change stressors like soil warming. To test this, we analyzed the genomes of bacteria from a soil warming experiment where soil plots have been experimentally heated 5°C above ambient for over 30 years. While genomic attributes were unchanged by long-term warming, we observed trends in functional gene content related to carbon and nitrogen usage and genomic indicators of growth efficiency. These responses may represent new parameters in how soil ecosystems feedback to the climate system. 
    more » « less
  3. Greening, Chris (Ed.)
    ABSTRACT Aerobes require dioxygen (O2) to grow; anaerobes do not. However, nearly all microbes—aerobes, anaerobes, and facultative organisms alike—express enzymes whose substrates include O2, if only for detoxification. This presents a challenge when trying to assess which organisms are aerobic from genomic data alone. This challenge can be overcome by noting that O2utilization has wide-ranging effects on microbes: aerobes typically have larger genomes encoding distinctive O2-utilizing enzymes, for example. These effects permit high-quality prediction of O2utilization from annotated genome sequences, with several models displaying ≈80% accuracy on a ternary classification task for which blind guessing is only 33% accurate. Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well. We discovered that simple and efficient models based entirely on genomic sequence content—e.g., triplets of amino acids—perform as well as intensive annotation-based classifiers, enabling rapid processing of genomes. We further show that amino acid trimers are useful because they encode information about protein composition and phylogeny. To showcase the utility of rapid prediction, we estimated the prevalence of aerobes and anaerobes in diverse natural environments cataloged in the Earth Microbiome Project. Focusing on a well-studied O2gradient in the Black Sea, we found quantitative correspondence between local chemistry (O2:sulfide concentration ratio) and the composition of microbial communities. We, therefore, suggest that statistical methods like ours might be used to estimate, or “sense,” pivotal features of the chemical environment using DNA sequencing data.IMPORTANCEWe now have access to sequence data from a wide variety of natural environments. These data document a bewildering diversity of microbes, many known only from their genomes. Physiology—an organism’s capacity to engage metabolically with its environment—may provide a more useful lens than taxonomy for understanding microbial communities. As an example of this broader principle, we developed algorithms that accurately predict microbial dioxygen utilization directly from genome sequences without annotating genes, e.g., by considering only the amino acids in protein sequences. Annotation-free algorithms enable rapid characterization of natural samples, highlighting quantitative correspondence between sequences and local O2levels in a data set from the Black Sea. This example suggests that DNA sequencing might be repurposed as a multi-pronged chemical sensor, estimating concentrations of O2and other key facets of complex natural settings. 
    more » « less
  4. Abstract BackgroundSeagrasses are globally distributed marine flowering plants that play foundational roles in coastal environments as ecosystem engineers. While research efforts have explored various aspects of seagrass-associated microbial communities, including describing the diversity of bacteria, fungi and microbial eukaryotes, little is known about viral diversity in these communities. ResultsTo begin to address this, we leveraged metagenomic sequencing data to generate a catalog of bacterial metagenome-assembled genomes (MAGs) and phage genomes from the leaves of the seagrass,Zostera marina. We expanded the robustness of this viral catalog by incorporating publicly available metagenomic data from seagrass ecosystems. The final MAG set represents 85 high-quality draft and 62 medium-quality draft bacterial genomes. While the viral catalog represents 354 medium-quality, high-quality, and complete viral genomes. Predicted auxiliary metabolic genes in the final viral catalog had putative annotations largely related to carbon utilization, suggesting a possible role for phage in carbon cycling in seagrass ecosystems. ConclusionsThese genomic resources provide initial insight into bacterial-viral interactions in seagrass meadows and are a foundation on which to further explore these critical interkingdom interactions. These catalogs highlight a possible role for viruses in carbon cycling in seagrass beds which may have important implications for blue carbon management and climate change mitigation. 
    more » « less
  5. Kent, Angela D. (Ed.)
    ABSTRACT Methylmercury is a potent bioaccumulating neurotoxin that is produced by specific microorganisms that methylate inorganic mercury. Methylmercury production in diverse anaerobic bacteria and archaea was recently linked to the hgcAB genes. However, the full phylogenetic and metabolic diversity of mercury-methylating microorganisms has not been fully unraveled due to the limited number of cultured experimentally verified methylators and the limitations of primer-based molecular methods. Here, we describe the phylogenetic diversity and metabolic flexibility of putative mercury-methylating microorganisms by hgcAB identification in publicly available isolate genomes and metagenome-assembled genomes (MAGs) as well as novel freshwater MAGs. We demonstrate that putative mercury methylators are much more phylogenetically diverse than previously known and that hgcAB distribution among genomes is most likely due to several independent horizontal gene transfer events. The microorganisms we identified possess diverse metabolic capabilities spanning carbon fixation, sulfate reduction, nitrogen fixation, and metal resistance pathways. We identified 111 putative mercury methylators in a set of previously published permafrost metatranscriptomes and demonstrated that different methylating taxa may contribute to hgcA expression at different depths. Overall, we provide a framework for illuminating the microbial basis of mercury methylation using genome-resolved metagenomics and metatranscriptomics to identify putative methylators based upon hgcAB presence and describe their putative functions in the environment. IMPORTANCE Accurately assessing the production of bioaccumulative neurotoxic methylmercury by characterizing the phylogenetic diversity, metabolic functions, and activity of methylators in the environment is crucial for understanding constraints on the mercury cycle. Much of our understanding of methylmercury production is based on cultured anaerobic microorganisms within the Deltaproteobacteria , Firmicutes , and Euryarchaeota. Advances in next-generation sequencing technologies have enabled large-scale cultivation-independent surveys of diverse and poorly characterized microorganisms from numerous ecosystems. We used genome-resolved metagenomics and metatranscriptomics to highlight the vast phylogenetic and metabolic diversity of putative mercury methylators and their depth-discrete activities in thawing permafrost. This work underscores the importance of using genome-resolved metagenomics to survey specific putative methylating populations of a given mercury-impacted ecosystem. 
    more » « less