skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.

Title: A Novel Sparse Compositional Technique Reveals Microbial Perturbations
ABSTRACT The central aims of many host or environmental microbiome studies are to elucidate factors associated with microbial community compositions and to relate microbial features to outcomes. However, these aims are often complicated by difficulties stemming from high-dimensionality, non-normality, sparsity, and the compositional nature of microbiome data sets. A key tool in microbiome analysis is beta diversity, defined by the distances between microbial samples. Many different distance metrics have been proposed, all with varying discriminatory power on data with differing characteristics. Here, we propose a compositional beta diversity metric rooted in a centered log-ratio transformation and matrix completion called robust Aitchison PCA. We demonstrate the benefits of compositional transformations upstream of beta diversity calculations through simulations. Additionally, we demonstrate improved effect size, classification accuracy, and robustness to sequencing depth over the current methods on several decreased sample subsets of real microbiome data sets. Finally, we highlight the ability of this new beta diversity metric to retain the feature loadings linked to sample ordinations revealing salient intercommunity niche feature importance. IMPORTANCE By accounting for the sparse compositional nature of microbiome data sets, robust Aitchison PCA can yield high discriminatory power and salient feature ranking between microbial niches. The software to perform this analysis is available under an open-source license and can be obtained at ; additionally, a QIIME 2 plugin is provided to perform this analysis at .  more » « less
Award ID(s):
1804187 1804733 1804671
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Understanding the factors that influence microbes’ environmental distributions is important for determining drivers of microbial community composition. These include environmental variables like temperature and pH, and higher-dimensional variables like geographic distance and host species phylogeny. In microbial ecology, “specificity” is often described in the context of symbiotic or host parasitic interactions, but specificity can be more broadly used to describe the extent to which a species occupies a narrower range of an environmental variable than expected by chance. Using a standardization we describe here, Rao’s (Theor Popul Biol, 1982., Sankhya A, 2010. ) Quadratic Entropy can be conveniently applied to calculate specificity of a feature, such as a species, to many different environmental variables.


    We present our R packagespecificityfor performing the above analyses, and apply it to four real-life microbial data sets to demonstrate its application. We found that many fungi within the leaves of native Hawaiian plants had strong specificity to rainfall and elevation, even though these variables showed minimal importance in a previous analysis of fungal beta-diversity. In Antarctic cryoconite holes, our tool revealed that many bacteria have specificity to co-occurring algal community composition. Similarly, in the human gut microbiome, many bacteria showed specificity to the composition of bile acids. Finally, our analysis of the Earth Microbiome Project data set showed that most bacteria show strong ontological specificity to sample type. Our software performed as expected on synthetic data as well.


    specificityis well-suited to analysis of microbiome data, both in synthetic test cases, and across multiple environment types and experimental designs. The analysis and software we present here can reveal patterns in microbial taxa that may not be evident from a community-level perspective. These insights can also be visualized and interactively shared among researchers usingspecificity’s companion package,specificity.shiny.

    more » « less
  2. Abstract

    We have previously shown that the time ofChlamydiainfection was crucial in determining the chlamydial infectivity and pathogenesis. This study aims to determine whether the time ofChlamydiainfection affects the genital tract microbiome. This study analyzed mice vaginal, uterine, and ovary/oviduct microbiome with and withoutChlamydiainfection. The mice were infected withChlamydiaat either 10:00 am (ZT3) or 10:00 pm (ZT15). The results showed that mice infected at ZT3 had higherChlamydiainfectivity than those infected at ZT15. There was more variation in the compositional complexity of the vaginal microbiome (alpha diversity) of mice infected at ZT3 than those mice infected at ZT15 throughout the infection within each treatment group, with both Shannon and Simpson diversity index values decreased over time. The analysis of samples collected four weeks post-infection showed that there were significant taxonomical differences (beta diversity) between different parts of the genital tract—vagina, uterus, and ovary/oviduct—and this difference was associated with the time of infection.FirmicutesandProteobacteriawere the most abundant phyla within the microbiome in all three genital tract regions for all the samples collected during this experiment. Additionally,Firmicuteswas the dominant phylum in the uterine microbiome of ZT3Chlamydiainfected mice. The results show that the time of infection is associated with the microbial dynamics in the genital tract. And this association is more robust in the upper genital tract than in the vagina. This result implies that more emphasis should be placed on understanding the changes in the microbial dynamics of the upper genital tract over the course of infection.

    more » « less
  3. Abstract Motivation

    Differential abundance analysis is an essential and commonly used tool to characterize the difference between microbial communities. However, identifying differentially abundant microbes remains a challenging problem because the observed microbiome data are inherently compositional, excessive sparse, and distorted by experimental bias. Besides these major challenges, the results of differential abundance analysis also depend largely on the choice of analysis unit, adding another practical complexity to this already complicated problem.


    In this work, we introduce a new differential abundance test called the MsRDB test, which embeds the sequences into a metric space and integrates a multiscale adaptive strategy for utilizing spatial structure to identify differentially abundant microbes. Compared with existing methods, the MsRDB test can detect differentially abundant microbes at the finest resolution offered by data and provide adequate detection power while being robust to zero counts, compositional effect, and experimental bias in the microbial compositional dataset. Applications to both simulated and real microbial compositional datasets demonstrate the usefulness of the MsRDB test.

    Availability and implementation

    All analyses can be found under

    more » « less
  4. Large-scale microbiome studies investigating disease-inducing microbial roles base their findings on differences between microbial count data in contrasting environments (e.g., stool samples between cases and controls). These microbiome survey studies are often impeded by small sample sizes and database bias. Combining data from multiple survey studies often results in obvious batch effects, even when DNA preparation and sequencing methods are identical. Relatedly, predictive models trained on one microbial DNA dataset often do not generalize to outside datasets. In this study, we address these limitations by applying word embedding algorithms (GloVe) and PCA transformation to ASV data from the American Gut Project and generating translation matrices that can be applied to any 16S rRNA V4 region gut microbiome sequencing study. Because these approaches contextualize microbial occurrences in a larger dataset while reducing dimensionality of the feature space, they can improve generalization of predictive models that predict host phenotype from stool associated gut microbiota. The GMEmbeddings R package contains GloVe and PCA embedding transformation matrices at 50, 100 and 250 dimensions, each learned using ∼15,000 samples from the American Gut Project. It currently supports the alignment, matching, and matrix multiplication to allow users to transform their V4 16S rRNA data into these embedding spaces. We show how to correlate the properties in the new embedding space to KEGG functional pathways for biological interpretation of results. Lastly, we provide benchmarking on six gut microbiome datasets describing three phenotypes to demonstrate the ability of embedding-based microbiome classifiers to generalize to independent datasets. Future iterations of GMEmbeddings will include embedding transformation matrices for other biological systems. Available at: . 
    more » « less
  5. Greene, Casey S. (Ed.)
    ABSTRACT UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another (beta diversity). Striped UniFrac recently added the ability to split the problem into many independent subproblems, exhibiting nearly linear scaling but suffering from memory contention. Here, we adapt UniFrac to graphics processing units using OpenACC, enabling greater than 1,000× computational improvement, and apply it to 307,237 samples, the largest 16S rRNA V4 uniformly preprocessed microbiome data set analyzed to date. IMPORTANCE UniFrac is an important tool in microbiome research that is used for phylogenetically comparing microbiome profiles to one another. Here, we adapt UniFrac to operate on graphics processing units, enabling a 1,000× computational improvement. To highlight this advance, we perform what may be the largest microbiome analysis to date, applying UniFrac to 307,237 16S rRNA V4 microbiome samples preprocessed with Deblur. These scaling improvements turn UniFrac into a real-time tool for common data sets and unlock new research questions as more microbiome data are collected. 
    more » « less