Consider a case–control study in which we have a random sample, constructed in such a way that the proportion of cases in our sample is different from that in the general population—for instance, the sample is constructed to achieve a fixed ratio of cases to controls. Imagine that we wish to determine which of the potentially many covariates under study truly influence the response by applying the new model‐X knockoffs approach. This paper demonstrates that it suffices to design knockoff variables using data that may have a different ratio of cases to controls. For example, the knockoff variables can be constructed using the distribution of the original variables under any of the following scenarios: (a) a population of controls only; (b) a population of cases only; and (c) a population of cases and controls mixed in an arbitrary proportion (irrespective of the fraction of cases in the sample at hand). The consequence is that knockoff variables may be constructed using unlabelled data, which are often available more easily than labelled data, while maintaining Type‐I error guarantees.
more »
« less
This content will become publicly available on April 1, 2026
ARK: Robust knockoffs inference with coupling
We investigate the robustness of the model-X knockoffs framework with respect to the misspecified or estimated feature distribution. We achieve such a goal by theoretically studying the feature selection performance of a practically implemented knockoffs algorithm, which we name as the approximate knockoffs (ARK) procedure, under the measures of the false discovery rate (FDR) and k-familywise error rate (k-FWER). The approximate knockoffs procedure differs from the model-X knockoffs procedure only in that the former uses the misspecified or estimated feature distribution. A key technique in our theoretical analyses is to couple the approximate knockoffs procedure with the model-X knockoffs procedure so that random variables in these two procedures can be close in realizations. We prove that if such coupled model-X knockoffs procedure exists, the approximate knockoffs procedure can achieve the asymptotic FDR or k-FWER control at the target level. We showcase three specific constructions of such coupled model-X knockoff variables, verifying their existence and justifying the robustness of the model-X knockoffs framework. Additionally, we formally connect our concept of knockoff variable coupling to a type of Wasserstein distance.
more »
« less
- Award ID(s):
- 2125142
- PAR ID:
- 10654634
- Publisher / Repository:
- Institute of Mathematical Statistics
- Date Published:
- Journal Name:
- The Annals of Statistics
- Volume:
- 53
- Issue:
- 2
- ISSN:
- 0090-5364
- Page Range / eLocation ID:
- 749–773
- Subject(s) / Keyword(s):
- Statistics/Knockoffs inference, Wasserstein distance, false discovery rate control, familywise error rate control, coupling, robustness, high dimensionality
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract This study introduces a novelp-value-based multiple testing approach tailored for generalized linear models. Despite the crucial role of generalized linear models in statistics, existing methodologies face obstacles arising from the heterogeneous variance of response variables and complex dependencies among estimated parameters. Our aim is to address the challenge of controlling the false discovery rate (FDR) amidst arbitrarily dependent test statistics. Through the development of efficient computational algorithms, we present a versatile statistical framework for multiple testing. The proposed framework accommodates a range of tools developed for constructing a new model matrix in regression-type analysis, including random row permutations and Model-X knockoffs. We devise efficient computing techniques to solve the encountered non-trivial quadratic matrix equations, enabling the construction of pairedp-values suitable for the two-step multiple testing procedure proposed by Sarkar and Tang (Biometrika 109(4): 1149–1155, 2022). Theoretical analysis affirms the properties of our approach, demonstrating its capability to control the FDR at a given level. Empirical evaluations further substantiate its promising performance across diverse simulation settings.more » « less
-
The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR.more » « less
-
Gao, Xin (Ed.)Abstract MotivationConditional testing via the knockoff framework allows one to identify—among a large number of possible explanatory variables—those that carry unique information about an outcome of interest and also provides a false discovery rate guarantee on the selection. This approach is particularly well suited to the analysis of genome-wide association studies (GWAS), which have the goal of identifying genetic variants that influence traits of medical relevance. ResultsWhile conditional testing can be both more powerful and precise than traditional GWAS analysis methods, its vanilla implementation encounters a difficulty common to all multivariate analysis methods: it is challenging to distinguish among multiple, highly correlated regressors. This impasse can be overcome by shifting the object of inference from single variables to groups of correlated variables. To achieve this, it is necessary to construct “group knockoffs.” While successful examples are already documented in the literature, this paper substantially expands the set of algorithms and software for group knockoffs. We focus in particular on second-order knockoffs, for which we describe correlation matrix approximations that are appropriate for GWAS data and that result in considerable computational savings. We illustrate the effectiveness of the proposed methods with simulations and with the analysis of albuminuria data from the UK Biobank. Availability and implementationThe described algorithms are implemented in an open-source Julia package Knockoffs.jl. R and Python wrappers are available as knockoffsr and knockoffspy packages.more » « less
-
The traditional framework for feature selection treats all features as costing the same amount. However, in reality, a scientist often has considerable discretion regarding which variables to measure, and the decision involves a tradeoff between model accuracy and cost (where cost can refer to money, time, difficulty or intrusiveness). In particular, unnecessarily including an expensive feature in a model is worse than unnecessarily including a cheap feature. We propose a procedure, which we call cheap knockoffs, for performing feature selection in a cost‐conscious manner. The key idea behind our method is to force higher cost features to compete with more knockoffs than cheaper features. We derive an upper bound on the weighted false discovery proportion associated with this procedure, which corresponds to the fraction of the feature cost that is wasted on unimportant features. We prove that this bound holds simultaneously with high probability over a path of selected variable sets of increasing size. A user may thus select a set of features based, for example, on the overall budget, while knowing that no more than a particular fraction of feature cost is wasted. We investigate, through simulation and a biomedical application, the practical importance of incorporating cost considerations into the feature selection process.more » « less
An official website of the United States government
