skip to main content


Title: Searching for robust associations with a multi-environment knockoff filter
Summary In this article we develop a method based on model-X knockoffs to find conditional associations that are consistent across environments, while controlling the false discovery rate. The motivation for this problem is that large datasets may contain numerous associations that are statistically significant and yet misleading, as they are induced by confounders or sampling imperfections. However, associations replicated under different conditions may be more interesting. In fact, sometimes consistency provably leads to valid causal inferences even if conditional associations do not. Although the proposed method is widely applicable, in this paper we highlight its relevance to genome-wide association studies, in which robustness across populations with diverse ancestries mitigates confounding due to unmeasured variants. The effectiveness of this approach is demonstrated by simulations and applications to UK Biobank data.  more » « less
Award ID(s):
1934578
NSF-PAR ID:
10382351
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Biometrika
Volume:
109
Issue:
3
ISSN:
0006-3444
Page Range / eLocation ID:
611 to 629
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Background Increased work through electronic health record (EHR) messaging is frequently cited as a factor of physician burnout. However, studies to date have relied on anecdotal or self-reported measures, which limit the ability to match EHR use patterns with continuous stress patterns throughout the day. Objective The aim of this study is to collect EHR use and physiologic stress data through unobtrusive means that provide objective and continuous measures, cluster distinct patterns of EHR inbox work, identify physicians’ daily physiologic stress patterns, and evaluate the association between EHR inbox work patterns and physician physiologic stress. Methods Physicians were recruited from 5 medical centers. Participants (N=47) were given wrist-worn devices (Garmin Vivosmart 3) with heart rate sensors to wear for 7 days. The devices measured physiological stress throughout the day based on heart rate variability (HRV). Perceived stress was also measured with self-reports through experience sampling and a one-time survey. From the EHR system logs, the time attributed to different activities was quantified. By using a clustering algorithm, distinct inbox work patterns were identified and their associated stress measures were compared. The effects of EHR use on physician stress were examined using a generalized linear mixed effects model. Results Physicians spent an average of 1.08 hours doing EHR inbox work out of an average total EHR time of 3.5 hours. Patient messages accounted for most of the inbox work time (mean 37%, SD 11%). A total of 3 patterns of inbox work emerged: inbox work mostly outside work hours, inbox work mostly during work hours, and inbox work extending after hours that were mostly contiguous to work hours. Across these 3 groups, physiologic stress patterns showed 3 periods in which stress increased: in the first hour of work, early in the afternoon, and in the evening. Physicians in group 1 had the longest average stress duration during work hours (80 out of 243 min of valid HRV data; P=.02), as measured by physiological sensors. Inbox work duration, the rate of EHR window switching (moving from one screen to another), the proportion of inbox work done outside of work hours, inbox work batching, and the day of the week were each independently associated with daily stress duration (marginal R2=15%). Individual-level random effects were significant and explained most of the variation in stress (conditional R2=98%). Conclusions This study is among the first to demonstrate associations between electronic inbox work and physiological stress. We identified 3 potentially modifiable factors associated with stress: EHR window switching, inbox work duration, and inbox work outside work hours. Organizations seeking to reduce physician stress may consider system-based changes to reduce EHR window switching or inbox work duration or the incorporation of inbox management time into work hours. 
    more » « less
  2. Abstract

    In the statistical analysis of genome-wide association data, it is challenging to precisely localize the variants that affect complex traits, due to linkage disequilibrium, and to maximize power while limiting spurious findings. Here we report onKnockoffZoom: a flexible method that localizes causal variants at multiple resolutions by testing the conditional associations of genetic segments of decreasing width, while provably controlling the false discovery rate. Our method utilizes artificial genotypes as negative controls and is equally valid for quantitative and binary phenotypes, without requiring any assumptions about their genetic architectures. Instead, we rely on well-established genetic models of linkage disequilibrium. We demonstrate that our method can detect more associations than mixed effects models and achieve fine-mapping precision, at comparable computational cost. Lastly, we applyKnockoffZoomto data from 350k subjects in the UK Biobank and report many new findings.

     
    more » « less
  3. Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so‐called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data‐driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example. Copyright © 2016 John Wiley & Sons, Ltd.

     
    more » « less
  4. Abstract Objective

    We investigated the relationship between early life growth patterns and blood telomere length (TL) in adulthood using conditional measures of lean and fat mass growth to evaluate potentially sensitive periods of early life growth.

    Methods

    This study included data from 1562 individuals (53% male; age 20‐22 years) participating in the Cebu Longitudinal Health and Nutrition Survey, located in metropolitan Cebu, Philippines. Primary exposures included length‐for‐agez‐score (HAZ) and weight‐for‐agez‐score (WAZ) at birth and conditional measures of linear growth and weight gain during four postnatal periods: 0‐6, 6‐12, and 12‐24 months, and 24 months to 8.5 years. TL was measured at ~21 years of age. We estimated associations using linear regression.

    Results

    The study sample had an average gestational age (38.5 ± 2 weeks) and birth size (HAZ = –0.2 ± 1.1, WAZ = –0.7 ± 1.0), but by age 8.5 years had stunted linear growth (HAZ = –2.1 ± 0.9) and borderline low weight (WAZ = –1.9 ± 1.0) relative to World Health Organization references. Heavier birth weight was associated with longer TL in early adulthood (P= .03), but this association was attenuated when maternal age at birth was included in the model (P= .07). Accelerated linear growth between 6 and 12 months was associated with longer TL in adulthood (P= .006), whereas weight gain between 12 and 24 months was associated with shorter TL in adulthood (P= .047).

    Conclusions

    In Cebu, individuals who were born heavier have longer TL in early adulthood, but that birthweight itself may not explain the association. Findings suggest that childhood growth is associated with the cellular senescence process in adulthood, implying early life well‐being may be linked to adult health.

     
    more » « less
  5. Abstract

    Community structure is a fundamental topological characteristic of optimally organized brain networks. Currently, there is no clear standard or systematic approach for selecting the most appropriate community detection method. Furthermore, the impact of method choice on the accuracy and robustness of estimated communities (and network modularity), as well as method‐dependent relationships between network communities and cognitive and other individual measures, are not well understood. This study analyzed large datasets of real brain networks (estimated from resting‐state fMRI from = 5251 pre/early adolescents in the adolescent brain cognitive development [ABCD] study), and = 5338 synthetic networks with heterogeneous, data‐inspired topologies, with the goal to investigate and compare three classes of community detection methods: (i) modularity maximization‐based (Newman and Louvain), (ii) probabilistic (Bayesian inference within the framework of stochastic block modeling (SBM)), and (iii) geometric (based on graph Ricci flow). Extensive comparisons between methods and their individual accuracy (relative to the ground truth in synthetic networks), and reliability (when applied to multiple fMRI runs from the same brains) suggest that the underlying brain network topology plays a critical role in the accuracy, reliability and agreement of community detection methods. Consistent method (dis)similarities, and their correlations with topological properties, were estimated across fMRI runs. Based on synthetic graphs, most methods performed similarly and had comparable high accuracy only in some topological regimes, specifically those corresponding to developed connectomes with at least quasi‐optimal community organization. In contrast, in densely and/or weakly connected networks with difficult to detect communities, the methods yielded highly dissimilar results, with Bayesian inference within SBM having significantly higher accuracy compared to all others. Associations between method‐specific modularity and demographic, anthropometric, physiological and cognitive parameters showed mostly method invariance but some method dependence as well. Although method sensitivity to different levels of community structure may in part explain method‐dependent associations between modularity estimates and parameters of interest, method dependence also highlights potential issues of reliability and reproducibility. These findings suggest that a probabilistic approach, such as Bayesian inference in the framework of SBM, may provide consistently reliable estimates of community structure across network topologies. In addition, to maximize robustness of biological inferences, identified network communities and their cognitive, behavioral and other correlates should be confirmed with multiple reliable detection methods.

     
    more » « less