skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Direct covariance matrix estimation with compositional data
Compositional data arise in many areas of research in the natural and biomedical sciences. One prominent example is in the study of the human gut microbiome, where one can measure the relative abundance of many distinct microorganisms in a subject’s gut. Often, practitioners are interested in learning how the dependencies between microbes vary across distinct populations or experimental conditions. In statistical terms, the goal is to estimate a covariance matrix for the (latent) log-abundances of the microbes in each of the populations. However, the compositional nature of the data prevents the use of standard estimators for these covariance matrices. In this article, we propose an estimator of multiple covariance matrices which allows for information sharing across distinct populations of samples. Compared to some existing estimators, which estimate the covariance matrices of interest indirectly, our estimator is direct, ensures positive definiteness, and is the solution to a convex optimization problem. We compute our estimator using a proximal-proximal gradient descent algorithm. Asymptotic properties of our estimator reveal that it can perform well in high-dimensional settings. We show that our method provides more reliable estimates than competitors in an analysis of microbiome data from subjects with myalgic encephalomyelitis/chronic fatigue syndrome and through simulation studies.  more » « less
Award ID(s):
2113589 2415067
PAR ID:
10533314
Author(s) / Creator(s):
; ;
Publisher / Repository:
Institute of Mathematical Statistics
Date Published:
Journal Name:
Electronic Journal of Statistics
Volume:
18
Issue:
1
ISSN:
1935-7524
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multisample approach to estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real datasets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. Theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback–Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset. 
    more » « less
  2. null (Ed.)
    Summary High-throughput sequencing technology provides unprecedented opportunities to quantitatively explore human gut microbiome and its relation to diseases. Microbiome data are compositional, sparse, noisy, and heterogeneous, which pose serious challenges for statistical modeling. We propose an identifiable Bayesian multinomial matrix factorization model to infer overlapping clusters on both microbes and hosts. The proposed method represents the observed over-dispersed zero-inflated count matrix as Dirichlet-multinomial mixtures on which latent cluster structures are built hierarchically. Under the Bayesian framework, the number of clusters is automatically determined and available information from a taxonomic rank tree of microbes is naturally incorporated, which greatly improves the interpretability of our findings. We demonstrate the utility of the proposed approach by comparing to alternative methods in simulations. An application to a human gut microbiome data set involving patients with inflammatory bowel disease reveals interesting clusters, which contain bacteria families Bacteroidaceae, Bifidobacteriaceae, Enterobacteriaceae, Fusobacteriaceae, Lachnospiraceae, Ruminococcaceae, Pasteurellaceae, and Porphyromonadaceae that are known to be related to the inflammatory bowel disease and its subtypes according to biological literature. Our findings can help generate potential hypotheses for future investigation of the heterogeneity of the human gut microbiome. 
    more » « less
  3. Compositional data sets are ubiquitous in science, including geology, ecology, and microbiology. In microbiome research, compositional data primarily arise from high-throughput sequence-based profiling experiments. These data comprise microbial compositions in their natural habitat and are often paired with covariate measurements that characterize physicochemical habitat properties or the physiology of the host. Inferring parsimonious statistical associations between microbial compositions and habitat- or host-specific covariate data is an important step in exploratory data analysis. A standard statistical model linking compositional covariates to continuous outcomes is the linear log-contrast model. This model describes the response as a linear combination of log-ratios of the original compositions and has been extended to the high-dimensional setting via regularization. In this contribution, we propose a general convex optimization model for linear log-contrast regression which includes many previous proposals as special cases. We introduce a proximal algorithm that solves the resulting constrained optimization problem exactly with rigorous convergence guarantees. We illustrate the versatility of our approach by investigating the performance of several model instances on soil and gut microbiome data analysis tasks. 
    more » « less
  4. Neotropical wood‐eating catfishes (family Loricariidae) can occur in diverse assemblages with multiple genera and species feeding on the same woody detritus. As such, they present an intriguing system in which to examine the influence of host species identity on the vertebrate gut microbiome as well as to determine the potential role of gut bacteria in wood digestion. We characterized the gut microbiome of two co‐occurring catfish genera and four species: Panaqolus albomaculatus , Panaqolus gnomus , Panaqolus nocturnus, and Panaque bathyphilus , as well as that of submerged wood on which they feed. The gut bacterial community did not significantly vary across three gut regions (proximal, mid, distal) for any catfish species, although interspecific variation in the gut microbiome was significant, with magnitude of interspecific difference generally reflecting host phylogenetic proximity. Further, the gut microbiome of each species was significantly different to that present on the submerged wood. Inferring the genomic potential of the gut microbiome revealed that the majority of wood digesting pathways were at best equivalent to and more often depleted or nonexistent within the catfish gut compared to the submerged wood, suggesting a minimal role for the gut microbiome in wood digestion. Rather, these fishes are more likely reliant on fiber degradation performed by microbes in the environment, with their gut microbiome determined more by host identity and phylogenetic history. 
    more » « less
  5. Rawls, John F (Ed.)
    ABSTRACT Intestinal helminth parasite (IHP) infection induces alterations in the composition of microbial communities across vertebrates, although how gut microbiota may facilitate or hinder parasite infection remains poorly defined. In this work, we utilized a zebrafish model to investigate the relationship between gut microbiota, gut metabolites, and IHP infection. We found that extreme disparity in zebrafish parasite infection burden is linked to the composition of the gut microbiome and that changes in the gut microbiome are associated with variation in a class of endogenously produced signaling compounds, N-acylethanolamines, that are known to be involved in parasite infection. Using a statistical mediation analysis, we uncovered a set of gut microbes whose relative abundance explains the association between gut metabolites and infection outcomes. Experimental investigation of one of the compounds in this analysis reveals salicylaldehyde, which is putatively produced by the gut microbePelomonas, as a potent anthelmintic with activity againstPseudocapillaria tomentosaegg hatching, bothin vitroandin vivo. Collectively, our findings underscore the importance of the gut microbiome as a mediating agent in parasitic infection and highlight specific gut metabolites as tools for the advancement of novel therapeutic interventions against IHP infection. IMPORTANCEIntestinal helminth parasites (IHPs) impact human health globally and interfere with animal health and agricultural productivity. While anthelmintics are critical to controlling parasite infections, their efficacy is increasingly compromised by drug resistance. Recent investigations suggest the gut microbiome might mediate helminth infection dynamics. So, identifying how gut microbes interact with parasites could yield new therapeutic targets for infection prevention and management. We conducted a study using a zebrafish model of parasitic infection to identify routes by which gut microbes might impact helminth infection outcomes. Our research linked the gut microbiome to both parasite infection and to metabolites in the gut to understand how microbes could alter parasite infection. We identified a metabolite in the gut, salicylaldehyde, that is putatively produced by a gut microbe and that inhibits parasitic egg growth. Our results also point to a class of compounds, N-acyl-ethanolamines, which are affected by changes in the gut microbiome and are linked to parasite infection. Collectively, our results indicate the gut microbiome may be a source of novel anthelmintics that can be harnessed to control IHPs. 
    more » « less