skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: D-MANOVA: fast distance-based multivariate analysis of variance for large-scale microbiome association studies
Abstract Summary PERMANOVA (permutational multivariate analysis of variance based on distances) has been widely used for testing the association between the microbiome and a covariate of interest. Statistical significance is established by permutation, which is computationally intensive for large sample sizes. As large-scale microbiome studies, such as American Gut Project (AGP), become increasingly popular, a computationally efficient version of PERMANOVA is much needed. To achieve this end, we derive the asymptotic distribution of the PERMANOVA pseudo-F statistic and provide analytical P-value calculation based on chi-square approximation. We show that the asymptotic P-value is close to the PERMANOVA P-value even under a moderate sample size. Moreover, it is more accurate and an order-of-magnitude faster than the permutation-free method MDMR. We demonstrated the use of our procedure D-MANOVA on the AGP dataset. Availability and implementation D-MANOVA is implemented by the dmanova function in the CRAN package GUniFrac. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1830392 2113360 2113359
PAR ID:
10282829
Author(s) / Creator(s):
;
Editor(s):
Schwartz, Russell
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract SummaryDue to the sparsity and high dimensionality, microbiome data are routinely summarized into pairwise distances capturing the compositional differences. Many biological insights can be gained by analyzing the distance matrix in relation to some covariates. A microbiome sampling method that characterizes the inter-sample relationship more reproducibly is expected to yield higher statistical power. Traditionally, the intraclass correlation coefficient (ICC) has been used to quantify the degree of reproducibility for a univariate measurement using technical replicates. In this work, we extend the traditional ICC to distance measures and propose a distance-based ICC (dICC). We derive the asymptotic distribution of the sample-based dICC to facilitate statistical inference. We illustrate dICC using a real dataset from a metagenomic reproducibility study. Availability and implementationdICC is implemented in the R CRAN package GUniFrac. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  2. We are interested in testing general linear hypotheses in a high-dimensional multivariate linear regression model. The framework includes many well-studied problems such as two-sample tests for equality of population means, MANOVA and others as special cases. A family of rotation-invariant tests is proposed that involves a flexible spectral shrinkage scheme applied to the sample error covariance matrix. The asymptotic normality of the test statistic under the null hypothesis is derived in the setting where dimensionality is comparable to sample sizes, assuming the existence of certain moments for the observations. The asymptotic power of the proposed test is studied under various local alternatives. The power characteristics are then utilized to propose a data-driven selection of the spectral shrinkage function. As an illustration of the general theory, we construct a family of tests involving ridge-type regularization and suggest possible extensions to more complex regularizers. A simulation study is carried out to examine the numerical performance of the proposed tests. 
    more » « less
  3. Abstract BackgroundChildren are less susceptible to SARS-CoV-2 infection and typically have milder illness courses than adults, but the factors underlying these age-associated differences are not well understood. The upper respiratory microbiome undergoes substantial shifts during childhood and is increasingly recognized to influence host defense against respiratory pathogens. Thus, we sought to identify upper respiratory microbiome features associated with SARS-CoV-2 infection susceptibility and illness severity. MethodsWe collected clinical data and nasopharyngeal swabs from 285 children, adolescents, and young adults (<21 years) with documented SARS-CoV-2 exposure. We used 16S ribosomal RNA gene sequencing to characterize the nasopharyngeal microbiome and evaluated for age-adjusted associations between microbiome characteristics and SARS-CoV-2 infection status and respiratory symptoms. ResultsNasopharyngeal microbiome composition varied with age (PERMANOVA, P < .001; R2 = 0.06) and between SARS-CoV-2–infected individuals with and without respiratory symptoms (PERMANOVA, P  = .002; R2 = 0.009). SARS-CoV-2–infected participants with Corynebacterium/Dolosigranulum-dominant microbiome profiles were less likely to have respiratory symptoms than infected participants with other nasopharyngeal microbiome profiles (OR: .38; 95% CI: .18–.81). Using generalized joint attributed modeling, we identified 9 bacterial taxa associated with SARS-CoV-2 infection and 6 taxa differentially abundant among SARS-CoV-2–infected participants with respiratory symptoms; the magnitude of these associations was strongly influenced by age. ConclusionsWe identified interactive relationships between age and specific nasopharyngeal microbiome features that are associated with SARS-CoV-2 infection susceptibility and symptoms in children, adolescents, and young adults. Our data suggest that the upper respiratory microbiome may be a mechanism by which age influences SARS-CoV-2 susceptibility and illness severity. 
    more » « less
  4. Abstract MotivationLearning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data are high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data are often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks. ResultsIn this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes. Availability and implementationTADA is available at https://github.com/tada-alg/TADA. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. While the recent literature has seen a surge in the study of constrained bandit problems, all existing methods for these begin by assuming the feasibility of the underlying problem. We initiate the study of testing such feasibility assumptions, and in particular address the problem in the linear bandit setting, thus characterising the costs of feasibility testing for an unknown linear program using bandit feedback. Concretely, we test if 9x : Ax   0 for an unknown A 2 Rm×d, by playing a sequence of actions xt 2 Rd, and observing Axt + noise in response. By identifying the hypothesis as determining the sign of the value of a minimax game, we construct a novel test based on low-regret algorithms and a nonasymptotic law of iterated logarithms. We prove that this test is reliable, and adapts to the ‘signal level,’ T, of any instance, with mean sample costs scaling as O(d2/T2). We complement this by a minimax lower bound of (d/T2) for sample costs of reliable tests, dominating prior asymptotic lower bounds by capturing the dependence on d, and thus elucidating a basic insight missing in the extant literature on such problems. 
    more » « less