Abstract A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine‐mapping of the microbiome, we propose a two‐step compositional knockoff filter to provide the effective finite‐sample false discovery rate (FDR) control in high‐dimensional linear log‐contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum‐to‐zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high‐dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two‐step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.
more »
« less
Power of knockoff: The impact of ranking algorithm, augmented design, and symmetric statistic
The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.
more »
« less
- Award ID(s):
- 1943902
- PAR ID:
- 10531762
- Publisher / Repository:
- MIT Press
- Date Published:
- Journal Name:
- Journal of machine learning research
- Volume:
- 25
- Issue:
- 3
- ISSN:
- 1533-7928
- Page Range / eLocation ID:
- 1-67
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Controlling false discovery rate (FDR) is crucial for variable selection, multiple testing, among other signal detection problems. In literature, there is certainly no shortage of FDR control strategies when selecting individual features, but the relevant works for structural change detection, such as profile analysis for piecewise constant coefficients and integration analysis with multiple data sources, are limited. In this paper, we propose a generalized knockoff procedure (GKnockoff) for FDR control under such problem settings. We prove that the GKnockoff possesses pairwise exchangeability, and is capable of controlling the exact FDR under finite sample sizes. We further explore GKnockoff under high dimensionality, by first introducing a new screening method to filter the high-dimensional potential structural changes. We adopt a data splitting technique to first reduce the dimensionality via screening and then conduct GKnockoff on the refined selection set. Furthermore, the powers of proposed methods are systematically studied. Numerical comparisons with other methods show the superior performance of GKnockoff, in terms of both FDR control and power. We also implement the proposed methods to analyze a macroeconomic dataset for detecting changes of driven effects of economic development on the secondary industry.more » « less
-
null (Ed.)omparing the spatial characteristics of spatiotemporal random fields is often at demand. However, the comparison can be challenging due to the high-dimensional feature and dependency in the data. We develop a new multiple testing approach to detect local differences in the spatial characteristics of two spatiotemporal random fields by taking the spatial information into account. Our method adopts a twocomponent mixture model for location wise p-values and then derives a new false discovery rate (FDR) control, called mirror procedure, to determine the optimal rejection region. This procedure is robust to model misspecification and allows for weak dependency among hypotheses. To integrate the spatial heterogeneity, we model the mixture probability as well as study the benefit if any of allowing the alternative distribution to be spatially varying. AnEM-algorithm is developed to estimate the mixture model and implement the FDR procedure. We study the FDR control and the power of our new approach both theoretically and numerically, and apply the approach to compare the mean and teleconnection pattern between two synthetic climate fields. Supplementary materials for this article are available online.more » « less
-
ABSTRACT In many applications, the process of identifying a specific feature of interest often involves testing multiple hypotheses for their joint statistical significance. Examples include mediation analysis, which simultaneously examines the existence of the exposure-mediator and the mediator-outcome effects, and replicability analysis, aiming to identify simultaneous signals that exhibit statistical significance across multiple independent studies. In this work, we present a new approach called the joint mirror (JM) procedure that effectively detects such features while maintaining false discovery rate (FDR) control in finite samples. The JM procedure employs an iterative method that gradually shrinks the rejection region based on progressively revealed information until a conservative estimate of the false discovery proportion is below the target FDR level. Additionally, we introduce a more stringent error measure known as the composite FDR (cFDR), which assigns weights to each false discovery based on its number of null components. We use the leave-one-out technique to prove that the JM procedure controls the cFDR in finite samples. To implement the JM procedure, we propose an efficient algorithm that can incorporate partial ordering information. Through extensive simulations, we show that our procedure effectively controls the cFDR and enhances statistical power across various scenarios, including the case that test statistics are dependent across the features. Finally, we showcase the utility of our method by applying it to real-world mediation and replicability analyses.more » « less
-
We derive new algorithms for online multiple testing that provably control false discovery exceedance (FDX) while achieving orders of magnitude more power than previous methods. This statistical advance is enabled by the development of new algorithmic ideas: earlier algorithms are more “static” while our new ones allow for the dynamical adjustment of testing levels based on the amount of wealth the algorithm has accumulated. We demonstrate that our algorithms achieve higher power in a variety of synthetic experiments. We also prove that SupLORD can provide error control for both FDR and FDX, and controls FDR at stopping times. Stopping times are particularly important as they permit the experimenter to end the experiment arbitrarily early while maintaining desired control of the FDR. SupLORD is the first non-trivial algorithm, to our knowledge, that can control FDR at stopping times in the online setting.more » « less
An official website of the United States government

