skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Direct covariance matrix estimation with compositional data
Compositional data arise in many areas of research in the natural and biomedical sciences. One prominent example is in the study of the human gut microbiome, where one can measure the relative abundance of many distinct microorganisms in a subject’s gut. Often, practitioners are interested in learning how the dependencies between microbes vary across distinct populations or experimental conditions. In statistical terms, the goal is to estimate a covariance matrix for the (latent) log-abundances of the microbes in each of the populations. However, the compositional nature of the data prevents the use of standard estimators for these covariance matrices. In this article, we propose an estimator of multiple covariance matrices which allows for information sharing across distinct populations of samples. Compared to some existing estimators, which estimate the covariance matrices of interest indirectly, our estimator is direct, ensures positive definiteness, and is the solution to a convex optimization problem. We compute our estimator using a proximal-proximal gradient descent algorithm. Asymptotic properties of our estimator reveal that it can perform well in high-dimensional settings. We show that our method provides more reliable estimates than competitors in an analysis of microbiome data from subjects with myalgic encephalomyelitis/chronic fatigue syndrome and through simulation studies.  more » « less
Award ID(s):
2113589 2415067
PAR ID:
10533314
Author(s) / Creator(s):
; ;
Publisher / Repository:
Institute of Mathematical Statistics
Date Published:
Journal Name:
Electronic Journal of Statistics
Volume:
18
Issue:
1
ISSN:
1935-7524
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multisample approach to estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real datasets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. Theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback–Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset. 
    more » « less
  2. null (Ed.)
    Summary High-throughput sequencing technology provides unprecedented opportunities to quantitatively explore human gut microbiome and its relation to diseases. Microbiome data are compositional, sparse, noisy, and heterogeneous, which pose serious challenges for statistical modeling. We propose an identifiable Bayesian multinomial matrix factorization model to infer overlapping clusters on both microbes and hosts. The proposed method represents the observed over-dispersed zero-inflated count matrix as Dirichlet-multinomial mixtures on which latent cluster structures are built hierarchically. Under the Bayesian framework, the number of clusters is automatically determined and available information from a taxonomic rank tree of microbes is naturally incorporated, which greatly improves the interpretability of our findings. We demonstrate the utility of the proposed approach by comparing to alternative methods in simulations. An application to a human gut microbiome data set involving patients with inflammatory bowel disease reveals interesting clusters, which contain bacteria families Bacteroidaceae, Bifidobacteriaceae, Enterobacteriaceae, Fusobacteriaceae, Lachnospiraceae, Ruminococcaceae, Pasteurellaceae, and Porphyromonadaceae that are known to be related to the inflammatory bowel disease and its subtypes according to biological literature. Our findings can help generate potential hypotheses for future investigation of the heterogeneity of the human gut microbiome. 
    more » « less
  3. Compositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks. 
    more » « less
  4. Neotropical wood‐eating catfishes (family Loricariidae) can occur in diverse assemblages with multiple genera and species feeding on the same woody detritus. As such, they present an intriguing system in which to examine the influence of host species identity on the vertebrate gut microbiome as well as to determine the potential role of gut bacteria in wood digestion. We characterized the gut microbiome of two co‐occurring catfish genera and four species: Panaqolus albomaculatus , Panaqolus gnomus , Panaqolus nocturnus, and Panaque bathyphilus , as well as that of submerged wood on which they feed. The gut bacterial community did not significantly vary across three gut regions (proximal, mid, distal) for any catfish species, although interspecific variation in the gut microbiome was significant, with magnitude of interspecific difference generally reflecting host phylogenetic proximity. Further, the gut microbiome of each species was significantly different to that present on the submerged wood. Inferring the genomic potential of the gut microbiome revealed that the majority of wood digesting pathways were at best equivalent to and more often depleted or nonexistent within the catfish gut compared to the submerged wood, suggesting a minimal role for the gut microbiome in wood digestion. Rather, these fishes are more likely reliant on fiber degradation performed by microbes in the environment, with their gut microbiome determined more by host identity and phylogenetic history. 
    more » « less
  5. The mammary microbiome is a newly characterized bacterial niche that might offer biological insight into the development of breast cancer. Together with in-depth analysis of the gut microbiome in breast cancer, current evidence using next-generation sequencing and metabolic profiling suggests compositional and functional shifts in microbial consortia are associated with breast cancer. In this review, we discuss the fundamental studies that have progressed this important area of research, focusing on the roles of both the mammary tissue microbiome and the gut microbiome. From the literature, we identified the following major conclusions, (I) There are unique breast and gut microbial signatures (both compositional and functional) that are associated with breast cancer, (II) breast and gut microbiome compositional and breast functional dysbiosis represent potential early events of breast tumor development, (III) specific breast and gut microbes confer host immune responses that can combat breast tumor development and progression, and (IV) chemotherapies alter the microbiome and thus maintenance of a eubiotic microbiome may be key in breast cancer treatment. As the field expectantly advances, it is necessary for the role of the microbiome to continue to be elucidated using multi-omic approaches and translational animal models in order to improve predictive, preventive, and therapeutic strategies for breast cancer. 
    more » « less