skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Generalized Bayesian Stochastic Block Model for Microbiome Community Detection
ABSTRACT Advances in next‐generation sequencing technology have enabled the high‐throughput profiling of metagenomes and accelerated microbiome studies. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co‐occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essential to understanding the role of the microbiome in disease progression and susceptibility. Taxonomic abundance data generated from metagenomic sequencing technologies are high‐dimensional and compositional, suffering from uneven sampling depth, over‐dispersion, and zero‐inflation. These characteristics often challenge the reliability of the current methods for microbiome community detection. To study the microbiome co‐occurrence network and perform community detection, we propose a generalized Bayesian stochastic block model that is tailored for microbiome data analysis where the data are transformed using the recently developed modified centered‐log ratio transformation. Our model also allows us to leverage taxonomic tree information using a Markov random field prior. The model parameters are jointly inferred by using Markov chain Monte Carlo sampling techniques. Our simulation study showed that the proposed approach performs better than competing methods even when taxonomic tree information is non‐informative. We applied our approach to a real urinary microbiome dataset from postmenopausal women. To the best of our knowledge, this is the first time the urinary microbiome co‐occurrence network structure in postmenopausal women has been studied. In summary, this statistical methodology provides a new tool for facilitating advanced microbiome studies.  more » « less
Award ID(s):
2210912
PAR ID:
10580530
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistics in Medicine
Volume:
44
Issue:
3-4
ISSN:
0277-6715
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate OGU adoption in future metagenomics studies. Importance Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. However, current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution compared to 16S rRNA amplicon sequence variant analysis. To solve these challenges, we introduce Operational Genomic Units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition while (ii) permitting use of phylogeny-aware tools. Our analysis of real-world datasets shows several advantages over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGU as standard practice in metagenomic studies. 
    more » « less
  2. Abstract Modern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call  (ee-ggregation of ompositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest. 
    more » « less
  3. null (Ed.)
    Polyomaviruses are abundant in the human body. The polyomaviruses JC virus (JCPyV) and BK virus (BKPyV) are common viruses in the human urinary tract. Prior studies have estimated that JCPyV infects between 20 and 80% of adults and that BKPyV infects between 65 and 90% of individuals by age 10. However, these two viruses encode for the same six genes and share 75% nucleotide sequence identity across their genomes. While prior urinary virome studies have repeatedly reported the presence of JCPyV, we were interested in seeing how JCPyV prevalence compares to BKPyV. We retrieved all publicly available shotgun metagenomic sequencing reads from urinary microbiome and virome studies (n = 165). While one third of the data sets produced hits to JCPyV, upon further investigation were we able to determine that the majority of these were in fact BKPyV. This distinction was made by specifically mining for JCPyV and BKPyV and considering uniform coverage across the genome. This approach provides confidence in taxon calls, even between closely related viruses with significant sequence similarity. 
    more » « less
  4. Advances in metagenomic sequencing have provided an unprecedented view of the microbial world, but untangling the web of microbe interdependencies and the complex relationship between microbiome and host is a major challenge in biology. New statistical methods are needed to analyze metagenomic data and infer these relationships. Focusing on amplicon sequencing data, we present methods for leveraging phylogenetic information in deep neural network models and for transfer learning from large data repositories. This approach is demonstrated in experiments using data from the Earth Microbiome Project (EMP) and a dataset of 1500 samples from Waimea Valley on the island of Oahu, Hawaii. 
    more » « less
  5. Switchgrass (Panicum virgatum L.) remains the preeminent American perennial (C4) bioenergy crop for cellulosic ethanol, that could help displace over a quarter of the US current petroleum consumption. Intriguingly, there is often little response to nitrogen fertilizer once stands are established. The rhizosphere microbiome plays a critical role in nitrogen cycling and overall plant nutrient uptake. We used high-throughput metagenomic sequencing to characterize the switchgrass rhizosphere microbial community before and after a nitrogen fertilization event for established stands on marginal land. We examined community structure and bulk metabolic potential, and resolved 29 individual bacteria genomes via metagenomic de novo assembly. Community structure and diversity were not significantly different before and after fertilization; however, the bulk metabolic potential of carbohydrate-active enzymes was depleted after fertilization. We resolved 29 metagenomic assembled genomes, including some from the ‘most wanted’ soil taxa such as Verrucomicrobia, Candidate phyla UBA10199, Acidobacteria (rare subgroup 23), Dormibacterota, and the very rare Candidatus Eisenbacteria. The Dormibacterota (formally candidate division AD3) we identified have the potential for autotrophic CO utilization, which may impact carbon partitioning and storage. Our study also suggests that the rhizosphere microbiome may be involved in providing associative nitrogen fixation (ANF) via the novel diazotroph Janthinobacterium to switchgrass. 
    more » « less