Model-X knockoffs is a flexible wrapper method for high-dimensional regression algorithms, which provides guaranteed control of the false discovery rate (FDR). Due to the randomness inherent to the method, different runs of model-X knockoffs on the same dataset often result in different sets of selected variables, which is undesirable in practice. In this article, we introduce a methodology for derandomising model-X knockoffs with provable FDR control. The key insight of our proposed method lies in the discovery that the knockoffs procedure is in essence an e-BH procedure. We make use of this connection and derandomise model-X knockoffs by aggregating the e-values resulting from multiple knockoff realisations. We prove that the derandomised procedure controls the FDR at the desired level, without any additional conditions (in contrast, previously proposed methods for derandomisation are not able to guarantee FDR control). The proposed method is evaluated with numerical experiments, where we find that the derandomised procedure achieves comparable power and dramatically decreased selection variability when compared with model-X knockoffs.
Many contemporary large-scale applications involve building interpretable models linking a large set of potential covariates to a response in a non-linear fashion, such as when the response is binary. Although this modelling problem has been extensively studied, it remains unclear how to control the fraction of false discoveries effectively even in high dimensional logistic regression, not to mention general high dimensional non-linear models. To address such a practical problem, we propose a new framework of ‘model-X’ knockoffs, which reads from a different perspective the knockoff procedure that was originally designed for controlling the false discovery rate in linear models. Whereas the knockoffs procedure is constrained to homoscedastic linear models with n⩾p, the key innovation here is that model-X knockoffs provide valid inference from finite samples in settings in which the conditional distribution of the response is arbitrary and completely unknown. Furthermore, this holds no matter the number of covariates. Correct inference in such a broad setting is achieved by constructing knockoff variables probabilistically instead of geometrically. To do this, our approach requires that the covariates are random (independent and identically distributed rows) with a distribution that is known, although we provide preliminary experimental evidence that our procedure is robust to unknown or estimated distributions. To our knowledge, no other procedure solves the controlled variable selection problem in such generality but, in the restricted settings where competitors exist, we demonstrate the superior power of knockoffs through simulations. Finally, we apply our procedure to data from a case–control study of Crohn's disease in the UK, making twice as many discoveries as the original analysis of the same data.
more » « less- PAR ID:
- 10397977
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Journal of the Royal Statistical Society Series B: Statistical Methodology
- Volume:
- 80
- Issue:
- 3
- ISSN:
- 1369-7412
- Format(s):
- Medium: X Size: p. 551-577
- Size(s):
- p. 551-577
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Consider a case–control study in which we have a random sample, constructed in such a way that the proportion of cases in our sample is different from that in the general population—for instance, the sample is constructed to achieve a fixed ratio of cases to controls. Imagine that we wish to determine which of the potentially many covariates under study truly influence the response by applying the new model‐X knockoffs approach. This paper demonstrates that it suffices to design knockoff variables using data that may have a different ratio of cases to controls. For example, the knockoff variables can be constructed using the distribution of the original variables under any of the following scenarios: (a) a population of controls only; (b) a population of cases only; and (c) a population of cases and controls mixed in an arbitrary proportion (irrespective of the fraction of cases in the sample at hand). The consequence is that knockoff variables may be constructed using unlabelled data, which are often available more easily than labelled data, while maintaining Type‐I error guarantees.
-
Abstract A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine‐mapping of the microbiome, we propose a two‐step compositional knockoff filter to provide the effective finite‐sample false discovery rate (FDR) control in high‐dimensional linear log‐contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum‐to‐zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high‐dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two‐step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.
-
Abstract This study introduces a novel
p -value-based multiple testing approach tailored for generalized linear models. Despite the crucial role of generalized linear models in statistics, existing methodologies face obstacles arising from the heterogeneous variance of response variables and complex dependencies among estimated parameters. Our aim is to address the challenge of controlling the false discovery rate (FDR) amidst arbitrarily dependent test statistics. Through the development of efficient computational algorithms, we present a versatile statistical framework for multiple testing. The proposed framework accommodates a range of tools developed for constructing a new model matrix in regression-type analysis, including random row permutations and Model-X knockoffs. We devise efficient computing techniques to solve the encountered non-trivial quadratic matrix equations, enabling the construction of pairedp -values suitable for the two-step multiple testing procedure proposed by Sarkar and Tang (Biometrika 109(4): 1149–1155, 2022). Theoretical analysis affirms the properties of our approach, demonstrating its capability to control the FDR at a given level. Empirical evaluations further substantiate its promising performance across diverse simulation settings. -
The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.more » « less