skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Searching for robust associations with a multi-environment knockoff filter
Summary In this article we develop a method based on model-X knockoffs to find conditional associations that are consistent across environments, while controlling the false discovery rate. The motivation for this problem is that large datasets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations replicated under different conditions may be more interesting. In fact, sometimes consistency provably leads to valid causal inferences even if conditional associations do not. Although the proposed method is widely applicable, in this paper we highlight its relevance to genome-wide association studies, in which robustness across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to UK Biobank data.  more » « less
Award ID(s):
1934578
PAR ID:
10382351
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Biometrika
Volume:
109
Issue:
3
ISSN:
0006-3444
Page Range / eLocation ID:
611 to 629
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract In the statistical analysis of genome-wide association data, it is challenging to precisely localize the variants that affect complex traits, due to linkage disequilibrium, and to maximize power while limiting spurious findings. Here we report onKnockoffZoom: a flexible method that localizes causal variants at multiple resolutions by testing the conditional associations of genetic segments of decreasing width, while provably controlling the false discovery rate. Our method utilizes artificial genotypes as negative controls and is equally valid for quantitative and binary phenotypes, without requiring any assumptions about their genetic architectures. Instead, we rely on well-established genetic models of linkage disequilibrium. We demonstrate that our method can detect more associations than mixed effects models and achieve fine-mapping precision, at comparable computational cost. Lastly, we applyKnockoffZoomto data from 350k subjects in the UK Biobank and report many new findings. 
    more » « less
  2. Summary Phylogenetic association analysis plays a crucial role in investigating the correlation between microbial compositions and specific outcomes of interest in microbiome studies. However, existing methods for testing such associations have limitations related to the assumption of a linear association in high-dimensional settings and the handling of confounding effects. Hence, there is a need for methods capable of characterizing complex associations, including nonmonotonic relationships. This article introduces a novel phylogenetic association analysis framework and associated tests to address these challenges by employing conditional rank correlation as a measure of association. The proposed tests account for confounders in a fully nonparametric manner, ensuring robustness against outliers and the ability to detect diverse dependencies. The proposed framework aggregates conditional rank correlations for subtrees using weighted sum and maximum approaches to capture both dense and sparse signals. The significance level of the test statistics is determined by calibration through a nearest-neighbour bootstrapping method, which is straightforward to implement and can accommodate additional datasets when these are available. The practical advantages of the proposed framework are demonstrated through numerical experiments using both simulated and real microbiome datasets. 
    more » « less
  3. This study evaluated ChatGPT’s ability to understand causal language in science papers and news by testing its accuracy in a task of labeling the strength of a claim as causal, conditional causal, correlational, or no relationship. The results show that ChatGPT is still behind the existing fine-tuned BERT models by a large margin. ChatGPT also had difficulty understanding conditional causal claims mitigated by hedges. However, its weakness may be utilized to improve the clarity of human annotation guideline. Chain-of-thought prompting was faithful and helpful for improving prompt performance, but finding the optimal prompt is difficult with inconsistent results and the lack of effective method to establish cause-effect between prompts and outcomes, suggesting caution when generalizing prompt engineering results across tasks or models. 
    more » « less
  4. Abstract Food insecurity is rising across sub-Saharan Africa (SSA), where undernourishment continues to affect a large portion of the population, particularly young children. Studies examining the associations between crop diversity and childhood nutrition have recently proliferated but are characterized by inconsistent results and two key limitations. First, many studies focus only on the household level, overlooking the prospect that more diverse crops at village and regional levels may contribute to household food security. Second, many studies pool data from multiple countries, which may obscure important context-specific aspects of nutrition outcomes. Drawing on Demographic and Health Survey (DHS) data from 10 SSA countries, in combination with agricultural production estimates for 112 crop species, this study explores the associations between crop diversity at multiple scales (10-, 25-, and 50-kilometer radii) and children’s dietary diversity (HDDS). In addition to producing overall estimates across our sample, we measure country-specific associations to account for spatial heterogeneity. Results of the overall model show a negative association between crop diversity and dietary diversity. However, the country-specific analyses uncover extensive variability in these associations: in some cases, diversity is highly positively correlated with HDDS, while in others the estimated effect is negative or nonexistent. Our findings suggest that country-level analyses provide important nuance that may be masked in pooled analyses. Moreover, these findings foreground the importance of looking beyond household-level analyses to understand the dynamic role that local crop diversity, and its exchange across space, can play in supporting children’s dietary diversity. 
    more » « less
  5. The conditional average treatment effect (CATE) is the best measure of individual causal effects given baseline covariates. However, the CATE only captures the (conditional) average, and can overlook risks and tail events, which are important to treatment choice. In aggregate analyses, this is usually addressed by measuring the distributional treatment effect (DTE), such as differences in quantiles or tail expectations between treatment groups. Hypothetically, one can similarly fit conditional quantile regressions in each treatment group and take their difference, but this would not be robust to misspecification or provide agnostic best-in-class predictions. We provide a new robust and model-agnostic methodology for learning the conditional DTE (CDTE) for a class of problems that includes conditional quantile treatment effects, conditional super-quantile treatment effects, and conditional treatment effects on coherent risk measures given by f-divergences. Our method is based on constructing a special pseudo-outcome and regressing it on covariates using any regression learner. Our method is model-agnostic in that it can provide the best projection of CDTE onto the regression model class. Our method is robust in that even if we learn these nuisances nonparametrically at very slow rates, we can still learn CDTEs at rates that depend on the class complexity and even conduct inferences on linear projections of CDTEs. We investigate the behavior of our proposal in simulations, as well as in a case study of 401(k) eligibility effects on wealth. 
    more » « less