skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Evolution-Informed Neural Networks for Microbiome Data Analysis
Advances in metagenomic sequencing have provided an unprecedented view of the microbial world, but untangling the web of microbe interdependencies and the complex relationship between microbiome and host is a major challenge in biology. New statistical methods are needed to analyze metagenomic data and infer these relationships. Focusing on amplicon sequencing data, we present methods for leveraging phylogenetic information in deep neural network models and for transfer learning from large data repositories. This approach is demonstrated in experiments using data from the Earth Microbiome Project (EMP) and a dataset of 1500 samples from Waimea Valley on the island of Oahu, Hawaii.  more » « less
Award ID(s):
2124922
PAR ID:
10553813
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
978-1-6654-0126-5
Page Range / eLocation ID:
3386 to 3391
Format(s):
Medium: X
Location:
Houston, TX, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT Advances in next‐generation sequencing technology have enabled the high‐throughput profiling of metagenomes and accelerated microbiome studies. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co‐occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essential to understanding the role of the microbiome in disease progression and susceptibility. Taxonomic abundance data generated from metagenomic sequencing technologies are high‐dimensional and compositional, suffering from uneven sampling depth, over‐dispersion, and zero‐inflation. These characteristics often challenge the reliability of the current methods for microbiome community detection. To study the microbiome co‐occurrence network and perform community detection, we propose a generalized Bayesian stochastic block model that is tailored for microbiome data analysis where the data are transformed using the recently developed modified centered‐log ratio transformation. Our model also allows us to leverage taxonomic tree information using a Markov random field prior. The model parameters are jointly inferred by using Markov chain Monte Carlo sampling techniques. Our simulation study showed that the proposed approach performs better than competing methods even when taxonomic tree information is non‐informative. We applied our approach to a real urinary microbiome dataset from postmenopausal women. To the best of our knowledge, this is the first time the urinary microbiome co‐occurrence network structure in postmenopausal women has been studied. In summary, this statistical methodology provides a new tool for facilitating advanced microbiome studies. 
    more » « less
  2. Viruses play crucial roles in the ecology of microbial communities, yet they remain relatively understudied in their native environments. Despite many advancements in high-throughput whole-genome sequencing (WGS), sequence assembly, and annotation of viruses, the reconstruction of full-length viral genomes directly from metagenomic sequencing is possible only for the most abundant phages and requires long-read sequencing technologies. Additionally, the prediction of their cellular hosts remains difficult from conventional metagenomic sequencing alone. To address these gaps in the field and to accelerate the study of viruses directly in their native microbiomes, we developed an end-to-end bioinformatics platform for viral genome reconstruction and host attribution from metagenomic data using proximity-ligation sequencing (i.e., Hi-C). We demonstrate the capabilities of the platform by recovering and characterizing the metavirome of a variety of metagenomes, including a fecal microbiome that has also been sequenced with accurate long reads, allowing for the assessment and benchmarking of the new methods. The platform can accurately extract numerous near-complete viral genomes even from highly fragmented short-read assemblies and can reliably predict their cellular hosts with minimal false positives. To our knowledge, this is the first software for performing these tasks. Being significantly cheaper than long-read sequencing of comparable depth, the incorporation of proximity-ligation sequencing in microbiome research shows promise to greatly accelerate future advancements in the field. 
    more » « less
  3. Segata, Nicola (Ed.)
    The ability to predict human phenotypes and identify biomarkers of disease from metagenomic data is crucial for the development of therapeutics for microbiome-associated diseases. However, metagenomic data is commonly affected by technical variables unrelated to the phenotype of interest, such as sequencing protocol, which can make it difficult to predict phenotype and find biomarkers of disease. Supervised methods to correct for background noise, originally designed for gene expression and RNA-seq data, are commonly applied to microbiome data but may be limited because they cannot account for unmeasured sources of variation. Unsupervised approaches address this issue, but current methods are limited because they are ill-equipped to deal with the unique aspects of microbiome data, which is compositional, highly skewed, and sparse. We perform a comparative analysis of the ability of different denoising transformations in combination with supervised correction methods as well as an unsupervised principal component correction approach that is presently used in other domains but has not been applied to microbiome data to date. We find that the unsupervised principal component correction approach has comparable ability in reducing false discovery of biomarkers as the supervised approaches, with the added benefit of not needing to know the sources of variation apriori. However, in prediction tasks, it appears to only improve prediction when technical variables contribute to the majority of variance in the data. As new and larger metagenomic datasets become increasingly available, background noise correction will become essential for generating reproducible microbiome analyses. 
    more » « less
  4. Campbell, Barbara J. (Ed.)
    ABSTRACT Host-associated microbiomes can be critical for the health and proper development of animals and plants. The answers to many fundamental questions regarding the modes of acquisition and microevolution of microbiome communities remain to be established. Deciphering strain-level dynamics is essential to fully understand how microbial communities evolve, but the forces shaping the strain-level dynamics of microbial communities remain largely unexplored, mostly because of methodological issues and cost. Here, we used targeted strain-level deep sequencing to uncover the strain dynamics within a host-associated microbial community using the honey bee gut microbiome as a model system. Our results revealed that amplicon sequencing of conserved protein-coding gene regions using species-specific primers is a cost-effective and accurate method for exploring strain-level diversity. In fact, using this method we were able to confirm strain-level results that have been obtained from whole-genome shotgun sequencing of the honey bee gut microbiome but with a much higher resolution. Importantly, our deep sequencing approach allowed us to explore the impact of low-frequency strains (i.e., cryptic strains) on microbiome dynamics. Results show that cryptic strain diversity is not responsible for the observed variations in microbiome composition across bees. Altogether, the findings revealed new fundamental insights regarding strain dynamics of host-associated microbiomes. IMPORTANCE The factors driving fine-scale composition and dynamics of gut microbial communities are poorly understood. In this study, we used metagenomic amplicon deep sequencing to decipher the strain dynamics of two key members of the honey bee gut microbiome. Using this high-throughput and cost-effective approach, we were able to confirm results from previous large-scale whole-genome shotgun (WGS) metagenomic sequencing studies while also gaining additional insights into the community dynamics of two core members of the honey bee gut microbiome. Moreover, we were able to show that cryptic strains are not responsible for the observed variations in microbiome composition across bees. 
    more » « less
  5. We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies. 
    more » « less