skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: On the construction of knockoffs in case–control studies
Consider a case–control study in which we have a random sample, constructed in such a way that the proportion of cases in our sample is different from that in the general population—for instance, the sample is constructed to achieve a fixed ratio of cases to controls. Imagine that we wish to determine which of the potentially many covariates under study truly influence the response by applying the new model‐X knockoffs approach. This paper demonstrates that it suffices to design knockoff variables using data that may have a different ratio of cases to controls. For example, the knockoff variables can be constructed using the distribution of the original variables under any of the following scenarios: (a) a population of controls only; (b) a population of cases only; and (c) a population of cases and controls mixed in an arbitrary proportion (irrespective of the fraction of cases in the sample at hand). The consequence is that knockoff variables may be constructed using unlabelled data, which are often available more easily than labelled data, while maintaining Type‐I error guarantees.  more » « less
Award ID(s):
1654076 1712800
PAR ID:
10453822
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Stat
Volume:
8
Issue:
1
ISSN:
2049-1573
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The knockoff filter is a recent false discovery rate (FDR) control method for high-dimensional linear models. We point out that knockoff has three key components: ranking algorithm, augmented design, and symmetric statistic, and each component admits multiple choices. By considering various combinations of the three components, we obtain a collection of variants of knockoff. All these variants guarantee finite-sample FDR control, and our goal is to compare their power. We assume a Rare and Weak signal model on regression coeffi- cients and compare the power of different variants of knockoff by deriving explicit formulas of false positive rate and false negative rate. Our results provide new insights on how to improve power when controlling FDR at a targeted level. We also compare the power of knockoff with its propotype - a method that uses the same ranking algorithm but has access to an ideal threshold. The comparison reveals the additional price one pays by finding a data-driven threshold to control FDR. 
    more » « less
  2. Summary The statistical challenges in using big data for making valid statistical inference in the finite population have been well documented in literature. These challenges are due primarily to statistical bias arising from under‐coverage in the big data source to represent the population of interest and measurement errors in the variables available in the data set. By stratifying the population into a big data stratum and a missing data stratum, we can estimate the missing data stratum by using a fully responding probability sample and hence the population as a whole by using a data integration estimator. By expressing the data integration estimator as a regression estimator, we can handle measurement errors in the variables in big data and also in the probability sample. We also propose a fully nonparametric classification method for identifying the overlapping units and develop a bias‐corrected data integration estimator under misclassification errors. Finally, we develop a two‐step regression data integration estimator to deal with measurement errors in the probability sample. An advantage of the approach advocated in this paper is that we do not have to make unrealistic missing‐at‐random assumptions for the methods to work. The proposed method is applied to the real data example using 2015–2016 Australian Agricultural Census data. 
    more » « less
  3. Abstract Analysis of non-probability survey samples requires auxiliary information at the population level. Such information may also be obtained from an existing probability survey sample from the same finite population. Mass imputation has been used in practice for combining non-probability and probability survey samples and making inferences on the parameters of interest using the information collected only in the non-probability sample for the study variables. Under the assumption that the conditional mean function from the non-probability sample can be transported to the probability sample, we establish the consistency of the mass imputation estimator and derive its asymptotic variance formula. Variance estimators are developed using either linearization or bootstrap. Finite sample performances of the mass imputation estimator are investigated through simulation studies. We also address important practical issues of the method through the analysis of a real-world non-probability survey sample collected by the Pew Research Centre. 
    more » « less
  4. null (Ed.)
    Abstract Standard procedures for capture–mark–recapture modelling (CMR) for the study of animal demography include running goodness-of-fit tests on a general starting model. A frequent reason for poor model fit is heterogeneity in local survival among individuals captured for the first time and those already captured or seen on previous occasions. This deviation is technically termed a transience effect. In specific cases, simple, uni-state CMR modeling showing transients may allow researchers to assess the role of these transients on population dynamics. Transient individuals nearly always have a lower local survival probability, which may appear for a number of reasons. In most cases, transients arise due to permanent dispersal, higher mortality, or a combination of both. In the case of higher mortality, transients may be symptomatic of a cost of first reproduction. A few studies working at large spatial scales actually show that transients more often correspond to survival costs of first reproduction rather than to permanent dispersal, bolstering the interpretation of transience as a measure of costs of reproduction, since initial detections are often associated with first breeding attempts. Regardless of their cause, the loss of transients from a local population should lower population growth rate. We review almost 1000 papers using CMR modeling and find that almost 40% of studies fitting the searching criteria (N = 115) detected transients. Nevertheless, few researchers have considered the ecological or evolutionary meaning of the transient phenomenon. Only three studies from the reviewed papers considered transients to be a cost of first reproduction. We also analyze a long-term individual monitoring dataset (1988–2012) on a long-lived bird to quantify transients, and we use a life table response experiment (LTRE) to measure the consequences of transients at a population level. As expected, population growth rate decreased when the environment became harsher while the proportion of transients increased. LTRE analysis showed that population growth can be substantially affected by changes in traits that are variable under environmental stochasticity and deterministic perturbations, such as recruitment, fecundity of experienced individuals, and transient probabilities. This occurred even though sensitivities and elasticities of these parameters were much lower than those for adult survival. The proportion of transients also increased with the strength of density-dependence. These results have implications for ecological and evolutionary studies and may stimulate other researchers to explore the ecological processes behind the occurrence of transients in capture–recapture studies. In population models, the inclusion of a specific state for transients may help to make more reliable predictions for endangered and harvested species. 
    more » « less
  5. Abstract International large-scale assessments (ILSAs) play an important role in educational research and policy making. They collect valuable data on education quality and performance development across many education systems, giving countries the opportunity to share techniques, organisational structures, and policies that have proven efficient and successful. To gain insights from ILSA data, we identify non-cognitive variables associated with students’ academic performance. This problem has three analytical challenges: (a) academic performance is measured by cognitive items under a matrix sampling design; (b) there are many missing values in the non-cognitive variables; and (c) multiple comparisons due to a large number of non-cognitive variables. We consider an application to the Programme for International Student Assessment, aiming to identify non-cognitive variables associated with students’ performance in science. We formulate it as a variable selection problem under a general latent variable model framework and further propose a knockoff method that conducts variable selection with a controlled error rate for false selections. 
    more » « less