skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Building a genome-based understanding of bacterial pH preferences
The environmental preferences of many microbes remain undetermined. This is the case for bacterial pH preferences, which can be difficult to predict a priori despite the importance of pH as a factor structuring bacterial communities in many systems. We compiled data on bacterial distributions from five datasets spanning pH gradients in soil and freshwater systems (1470 samples), quantified the pH preferences of bacterial taxa across these datasets, and compiled genomic data from representative bacterial taxa. While taxonomic and phylogenetic information were generally poor predictors of bacterial pH preferences, we identified genes consistently associated with pH preference across environments. We then developed and validated a machine learning model to estimate bacterial pH preferences from genomic information alone, a model that could aid in the selection of microbial inoculants, improve species distribution models, or help design effective cultivation strategies. More generally, we demonstrate the value of combining biogeographic and genomic data to infer and predict the environmental preferences of diverse bacterial taxa.  more » « less
Award ID(s):
2133684
PAR ID:
10500205
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Science Advances
Date Published:
Journal Name:
Science Advances
Volume:
9
Issue:
17
ISSN:
2375-2548
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Genomic information is now available for a broad diversity of bacteria, including uncultivated taxa. However, we have corresponding knowledge on environmental preferences (i.e. bacterial growth responses across gradients in oxygen, pH, temperature, salinity, and other environmental conditions) for a relatively narrow swath of bacterial diversity. These limits to our understanding of bacterial ecologies constrain our ability to predict how assemblages will shift in response to global change factors, design effective probiotics, or guide cultivation efforts. We need innovative approaches that take advantage of expanding genome databases to accurately infer the environmental preferences of bacteria and validate the accuracy of these inferences. By doing so, we can broaden our quantitative understanding of the environmental preferences of the majority of bacterial taxa that remain uncharacterized. With this perspective, we highlight why it is important to infer environmental preferences from genomic information and discuss the range of potential strategies for doing so. In particular, we highlight concrete examples of how both cultivation-independent and cultivation-dependent approaches can be integrated with genomic data to develop predictive models. We also emphasize the limitations and pitfalls of these approaches and the specific knowledge gaps that need to be addressed to successfully expand our understanding of the environmental preferences of bacteria. 
    more » « less
  2. Abstract Abundance-weighted averaging is a simple and common method for estimating taxon preferences (optima) for phosphorus (P) and other environmental drivers of freshwater-ecosystem health. These optima can then be used to develop transfer functions to infer current and/or past environmental conditions of aquatic ecosystems in water-quality assessments and/or paleolimnological studies. However, estimates of species’ environmental preferences are influenced by the sample distribution and length of environmental gradients, which can differ between datasets used to develop and apply a transfer function. Here, we introduce a subsampling method to ensure a uniform and comparable distribution of samples along a P gradient in two similar ecosystems: the Everglades Protection Areas (EPA) and Big Cypress National Preserve (BICY) in South Florida, USA. Diatom optima were estimated for both wetlands using weighted averaging of untransformed and log-transformed periphyton mat total phosphorus (mat TP) values from the original datasets. We compared these estimates to those derived from random subsets of the original datasets. These subsets, referred to as “SUD” datasets, were created to ensure a uniform distribution of mat TP values along the gradient (both untransformed and log-transformed). We found that diatom assemblages in BICY and EPA were similar, dominated by taxa indicating oligotrophic conditions, and strongly influenced by P gradients. However, the original BICY datasets contained more samples with elevated mat TP concentrations than the EPA datasets, introducing a mathematical bias and resulting in a higher abundance of taxa with high mat TP optima in BICY. The weighted averaged mat TP optima of BICY and EPA taxa were positively correlated across all four dataset types, with taxa optima of SUD datasets exhibiting higher correlations than in the original datasets. Equalizing the mat TP sample distribution in the two datasets confirmed consistent mat TP estimates for diatom taxa between the two wetland complexes and improved transfer-function performance. Our findings suggest that diatom environmental preferences may be more reliable across regional scales than previously suggested and support the application of models developed in one region to another nearby region if environmental gradient lengths are equalized and data distribution along gradients is uniform. 
    more » « less
  3. The reconstruction of complete microbial metabolic pathways using ‘omics data from environmental samples remains challenging. Computational pipelines for pathway reconstruction that utilize machine learning methods to predict the presence or absence of KEGG modules in incomplete genomes are lacking. Here, we present MetaPathPredict, a software tool that incorporates machine learning models to predict the presence of complete KEGG modules within bacterial genomic datasets. Using gene annotation data and information from the KEGG module database, MetaPathPredict employs deep learning models to predict the presence of KEGG modules in a genome. MetaPathPredict can be used as a command line tool or as a Python module, and both options are designed to be run locally or on a compute cluster. Benchmarks show that MetaPathPredict makes robust predictions of KEGG module presence within highly incomplete genomes. 
    more » « less
  4. Abstract Global change is impacting biodiversity across all habitats on earth. New selection pressures from changing climatic conditions and other anthropogenic activities are creating heterogeneous ecological and evolutionary responses across many species' geographic ranges. Yet we currently lack standardised and reproducible tools to effectively predict the resulting patterns in species vulnerability to declines or range changes.We developed an informatic toolbox that integrates ecological, environmental and genomic data and analyses (environmental dissimilarity, species distribution models, landscape connectivity, neutral and adaptive genetic diversity, genotype‐environment associations and genomic offset) to estimate population vulnerability. In our toolbox, functions and data structures are coded in a standardised way so that it is applicable to any species or geographic region where appropriate data are available, for example individual or population sampling and genomic datasets (e.g. RAD‐seq, ddRAD‐seq, whole genome sequencing data) representing environmental variation across the species geographic range.To demonstrate multi‐species applicability, we apply our toolbox to three georeferenced genomic datasets for co‐occurring East African spiny reed frogs (Afrixalus fornasini, A. delicatusandA. sylvaticus) to predict their population vulnerability, as well as demonstrating that range loss projections based on adaptive variation can be accurately reproduced from a previous study using data for two European bat species (Myotis escaleraiandM. crypticus).Our framework sets the stage for large scale, multi‐species genomic datasets to be leveraged in a novel climate change vulnerability framework to quantify intraspecific differences in genetic diversity, local adaptation, range shifts and population vulnerability based on exposure, sensitivity and landscape barriers. 
    more » « less
  5. Ouzounis, Christos A (Ed.)
    The metabolic activity of microbial communities is central to their role in biogeochemical cycles, human health, and biotechnology. Despite the abundance of sequencing data characterizing these consortia, it remains a serious challenge to predict microbial metabolic traits from sequencing data alone. Here we culture 96 bacterial isolates individually and assay their ability to grow on 10 distinct compounds as a sole carbon source. Using these data as well as two existing datasets, we show that statistical approaches can accurately predict bacterial carbon utilization traits from genomes. First, we show that classifiers trained on gene content can accurately predict bacterial carbon utilization phenotypes by encoding phylogenetic information. These models substantially outperform predictions made by constraint-based metabolic models automatically constructed from genomes. This result solidifies our current knowledge about the strong connection between phylogeny and metabolic traits. However, phylogeny-based predictions fail to predict traits for taxa that are phylogenetically distant from any strains in the training set. To overcome this we train improved models on gene presence/absence to predict carbon utilization traits from gene content. We show that models that predict carbon utilization traits from gene presence/absence can generalize to taxa that are phylogenetically distant from the training set either by exploiting biochemical information for feature selection or by having sufficiently large datasets. In the latter case, we provide evidence that a statistical approach can identify putatively mechanistic genes involved in metabolic traits. Our study demonstrates the potential power for predicting microbial phenotypes from genotypes using statistical approaches. 
    more » « less