skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

Title: Detecting Rare and Common Haplotype–Environment Interaction under Uncertainty of Gene–Environment Independence Assumption

Finding rare variants and gene–environment interactions (GXE) is critical in dissecting complex diseases. We consider the problem of detecting GXE where G is a rare haplotype and E is a nongenetic factor. Such methods typically assume G-E independence, which may not hold in many applications. A pertinent example is lung cancer—there is evidence that variants on Chromosome 15q25.1 interact with smoking to affect the risk. However, these variants are associated with smoking behavior rendering the assumption of G-E independence inappropriate. With the motivation of detecting GXE under G-E dependence, we extend an existing approach, logistic Bayesian LASSO, which assumes G-E independence (LBL-GXE-I) by modeling G-E dependence through a multinomial logistic regression (referred to as LBL-GXE-D). Unlike LBL-GXE-I, LBL-GXE-D controls type I error rates in all situations; however, it has reduced power when G-E independence holds. To control type I error without sacrificing power, we further propose a unified approach, LBL-GXE, to incorporate uncertainty in the G-E independence assumption by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that LBL-GXE has power similar to that of LBL-GXE-I when G-E independence holds, yet has well-controlled type I errors in all situations. To illustrate the utility of LBL-GXE, we analyzed a lung cancer dataset and found several significant interactions in the 15q25.1 region, including one between a specific rare haplotype and smoking.

more » « less
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Medium: X Size: p. 344-355
["p. 344-355"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Multiple papers have studied the use of gene‐environment (GE) independence to enhance power for testing gene‐environment interaction in case‐control studies. However, studies that evaluate the role ofGEindependence in a meta‐analysis framework are limited. In this paper, we extend the single‐study empirical Bayes type shrinkage estimators proposed by Mukherjee and Chatterjee (2008) to a meta‐analysis setting that adjusts for uncertainty regarding the assumption ofGEindependence across studies. We use the retrospective likelihood framework to derive an adaptive combination of estimators obtained under the constrained model (assumingGEindependence) and unconstrained model (without assumptions ofGEindependence) with weights determined by measures ofGEassociation derived from multiple studies. Our simulation studies indicate that this newly proposed estimator has improved average performance across different simulation scenarios than the standard alternative of using inverse variance (covariance) weighted estimators that combines study‐specific constrained, unconstrained, or empirical Bayes estimators. The results are illustrated by meta‐analyzing 6 different studies of type 2 diabetes investigating interactions between genetic markers on the obesity relatedFTOgene and environmental factors body mass index and age.

    more » « less
  2. Background

    Risk-based screening for lung cancer is currently being considered in several countries; however, the optimal approach to determine eligibility remains unclear. Ensemble machine learning could support the development of highly parsimonious prediction models that maintain the performance of more complex models while maximising simplicity and generalisability, supporting the widespread adoption of personalised screening. In this work, we aimed to develop and validate ensemble machine learning models to determine eligibility for risk-based lung cancer screening.

    Methods and findings

    For model development, we used data from 216,714 ever-smokers recruited between 2006 and 2010 to the UK Biobank prospective cohort and 26,616 high-risk ever-smokers recruited between 2002 and 2004 to the control arm of the US National Lung Screening (NLST) randomised controlled trial. The NLST trial randomised high-risk smokers from 33 US centres with at least a 30 pack-year smoking history and fewer than 15 quit-years to annual CT or chest radiography screening for lung cancer. We externally validated our models among 49,593 participants in the chest radiography arm and all 80,659 ever-smoking participants in the US Prostate, Lung, Colorectal and Ovarian (PLCO) Screening Trial. The PLCO trial, recruiting from 1993 to 2001, analysed the impact of chest radiography or no chest radiography for lung cancer screening. We primarily validated in the PLCO chest radiography arm such that we could benchmark against comparator models developed within the PLCO control arm. Models were developed to predict the risk of 2 outcomes within 5 years from baseline: diagnosis of lung cancer and death from lung cancer. We assessed model discrimination (area under the receiver operating curve, AUC), calibration (calibration curves and expected/observed ratio), overall performance (Brier scores), and net benefit with decision curve analysis.

    Models predicting lung cancer death (UCL-D) and incidence (UCL-I) using 3 variables—age, smoking duration, and pack-years—achieved or exceeded parity in discrimination, overall performance, and net benefit with comparators currently in use, despite requiring only one-quarter of the predictors. In external validation in the PLCO trial, UCL-D had an AUC of 0.803 (95% CI: 0.783, 0.824) and was well calibrated with an expected/observed (E/O) ratio of 1.05 (95% CI: 0.95, 1.19). UCL-I had an AUC of 0.787 (95% CI: 0.771, 0.802), an E/O ratio of 1.0 (95% CI: 0.92, 1.07). The sensitivity of UCL-D was 85.5% and UCL-I was 83.9%, at 5-year risk thresholds of 0.68% and 1.17%, respectively, 7.9% and 6.2% higher than the USPSTF-2021 criteria at the same specificity. The main limitation of this study is that the models have not been validated outside of UK and US cohorts.


    We present parsimonious ensemble machine learning models to predict the risk of lung cancer in ever-smokers, demonstrating a novel approach that could simplify the implementation of risk-based lung cancer screening in multiple settings.

    more » « less
  3. The statistical practice of modeling interaction with two linear main effects and a product term is ubiquitous in the statistical and epidemiological literature. Most data modelers are aware that the misspecification of main effects can potentially cause severe type I error inflation in tests for interactions, leading to spurious detection of interactions. However, modeling practice has not changed. In this article, we focus on the specific situation where the main effects in the model are misspecified as linear terms and characterize its impact on common tests for statistical interaction. We then propose some simple alternatives that fix the issue of potential type I error inflation in testing interaction due to main effect misspecification. We show that when using the sandwich variance estimator for a linear regression model with a quantitative outcome and two independent factors, both the Wald and score tests asymptotically maintain the correct type I error rate. However, if the independence assumption does not hold or the outcome is binary, using the sandwich estimator does not fix the problem. We further demonstrate that flexibly modeling the main effect under a generalized additive model can largely reduce or often remove bias in the estimates and maintain the correct type I error rate for both quantitative and binary outcomes regardless of the independence assumption. We show, under the independence assumption and for a continuous outcome, overfitting and flexibly modeling the main effects does not lead to power loss asymptotically relative to a correctly specified main effect model. Our simulation study further demonstrates the empirical fact that using flexible models for the main effects does not result in a significant loss of power for testing interaction in general. Our results provide an improved understanding of the strengths and limitations for tests of interaction in the presence of main effect misspecification. Using data from a large biobank study “The Michigan Genomics Initiative”, we present two examples of interaction analysis in support of our results.

    more » « less
  4. Schwartz, Russell (Ed.)
    Abstract Motivation While gene–environment (GxE) interactions contribute importantly to many different phenotypes, detecting such interactions requires well-powered studies and has proven difficult. To address this, we combine two approaches to improve GxE power: simultaneously evaluating multiple phenotypes and using a two-step analysis approach. Previous work shows that the power to identify a main genetic effect can be improved by simultaneously analyzing multiple related phenotypes. For a univariate phenotype, two-step methods produce higher power for detecting a GxE interaction compared to single step analysis. Therefore, we propose a two-step approach to test for an overall GxE effect for multiple phenotypes. Results Using simulations we demonstrate that, when more than one phenotype has GxE effect (i.e. GxE pleiotropy), our approach offers substantial gain in power (18–43%) to detect an aggregate-level GxE effect for a multivariate phenotype compared to an analogous two-step method to identify GxE effect for a univariate phenotype. We applied the proposed approach to simultaneously analyze three lipids, LDL, HDL and Triglyceride with the frequency of alcohol consumption as environmental factor in the UK Biobank. The method identified two loci with an overall GxE effect on the vector of lipids, one of which was missed by the competing approaches. Availability and implementation We provide an R package MPGE implementing the proposed approach which is available from CRAN: Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. The explosion of biobank data offers unprecedented opportunities for gene-environment interaction (GxE) studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in G×E assessment, especially for set-based G×E variance component (VC) tests, which are a widely used strategy to boost overall G×E signals and to evaluate the joint G×E effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a S calable E xact A l G orithm for L arge-scale set-based G× E tests, to permit G×E VC tests for biobank-scale data. SEAGLE employs modern matrix computations to calculate the test statistic and p -value of the GxE VC test in a computationally efficient fashion, without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of 10 5 , is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate the performance of SEAGLE using extensive simulations. We illustrate its utility by conducting genome-wide gene-based G×E analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index. 
    more » « less