skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Multisample estimation of bacterial composition matrices in metagenomics data
Summary Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multisample approach to estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real datasets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. Theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback–Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset.  more » « less
Award ID(s):
1811868
PAR ID:
10166376
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Biometrika
Volume:
107
Issue:
1
ISSN:
0006-3444
Page Range / eLocation ID:
75 to 92
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. We introduce a surprisingly simple, interpretable and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies. 
    more » « less
  2. Host-associated microbiomes play important roles in host health and pathogen defense. In amphibians, the skin-associated microbiota can contribute to innate immunity with potential implications for disease management. Few studies have examined season-long temporal variation in the amphibian skin-associated microbiome, and the interactions between bacteria and fungi on amphibian skin remain poorly understood. We characterize season-long temporal variation in the skin-associated microbiome of the western tiger salamander ( Ambystoma mavortium ) for both bacteria and fungi between sites and across salamander life stages. Two hundred seven skin-associated microbiome samples were collected from salamanders at two Rocky Mountain lakes throughout the summer and fall of 2018, and 127 additional microbiome samples were collected from lake water and lake substrate. We used 16S rRNA and ITS amplicon sequencing with Bayesian Dirichlet-multinomial regression to estimate the relative abundances of bacterial and fungal taxa, test for differential abundance, examine microbial selection, and derive alpha diversity. We predicted the ability of bacterial communities to inhibit the amphibian chytrid fungus Batrachochytrium dendrobatidis ( Bd ), a cutaneous fungal pathogen, using stochastic character mapping and a database of Bd -inhibitory bacterial isolates. For both bacteria and fungi, we observed variation in community composition through time, between sites, and with salamander age and life stage. We further found that temporal trends in community composition were specific to each combination of salamander age, life stage, and lake. We found salamander skin to be selective for microbes, with many taxa disproportionately represented relative to the environment. Salamander skin appeared to select for predicted Bd -inhibitory bacteria, and we found a negative relationship between the relative abundances of predicted Bd -inhibitory bacteria and Bd . We hope these findings will assist in the conservation of amphibian species threatened by chytridiomycosis and other emerging diseases. 
    more » « less
  3. Abstract The study of microbiomes across organisms and environments has become a prominent focus in molecular ecology. This perspective article explores common challenges, methodological advancements, and future directions in the field. Key research areas include understanding the drivers of microbiome community assembly, linking microbiome composition to host genetics, exploring microbial functions, transience and spatial partitioning, and disentangling non‐bacterial components of the microbiome. Methodological advancements, such as quantifying absolute abundances, sequencing complete genomes, and utilizing novel statistical approaches, are also useful tools for understanding complex microbial diversity patterns. Our aims are to encourage robust practices in microbiome studies and inspire researchers to explore the next frontier of this rapidly changing field. 
    more » « less
  4. Li, Yue (Ed.)
    Microbiome sequencing data normalization is crucial for eliminating technical bias and ensuring accurate downstream analysis. However, this process can be challenging due to the high frequency of zero counts in microbiome data. We propose a novel reference-based normalization method called normalization via rank similarity (RSim) that corrects sample-specific biases, even in the presence of many zero counts. Unlike other normalization methods, RSim does not require additional assumptions or treatments for the high prevalence of zero counts. This makes it robust and minimizes potential bias resulting from procedures that address zero counts, such as pseudo-counts. Our numerical experiments demonstrate that RSim reduces false discoveries, improves detection power, and reveals true biological signals in downstream tasks such as PCoA plotting, association analysis, and differential abundance analysis. 
    more » « less
  5. An inherent issue in high-throughput rRNA gene tag sequencing microbiome surveys is that they provide compositional data in relative abundances. This often leads to spurious correlations, making the interpretation of relationships to biogeochemical rates challenging. To overcome this issue, we quantitatively estimated the abundance of microorganisms by spiking in known amounts of internal DNA standards. Using a 3-year sample set of diverse microbial communities from the Western Antarctica Peninsula, we demonstrated that the internal standard method yielded community profiles and taxon cooccurrence patterns substantially different from those derived using relative abundances. We found that the method provided results consistent with the traditional CHEMTAX analysis of pigments and total bacterial counts by flow cytometry. Using the internal standard method, we also showed that chloroplast 16S rRNA gene data in microbial surveys can be used to estimate abundances of certain eukaryotic phototrophs such as cryptophytes and diatoms. In Phaeocystis, scatter in the 16S/18S rRNA gene ratio may be explained by physiological adaptation to environmental conditions. We conclude that the internal standard method, when applied to rRNA gene microbial community profiling, is quantitative and that its application will substantially improve our understanding of microbial ecosystems. 
    more » « less