skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SEAGLE: A Scalable Exact Algorithm for Large-Scale Set-Based Gene-Environment Interaction Tests in Biobank Data
The explosion of biobank data offers unprecedented opportunities for gene-environment interaction (GxE) studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in G×E assessment, especially for set-based G×E variance component (VC) tests, which are a widely used strategy to boost overall G×E signals and to evaluate the joint G×E effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a S calable E xact A l G orithm for L arge-scale set-based G× E tests, to permit G×E VC tests for biobank-scale data. SEAGLE employs modern matrix computations to calculate the test statistic and p -value of the GxE VC test in a computationally efficient fashion, without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of 10 5 , is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate the performance of SEAGLE using extensive simulations. We illustrate its utility by conducting genome-wide gene-based G×E analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.  more » « less
Award ID(s):
1760374
PAR ID:
10338019
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Frontiers in Genetics
Volume:
12
ISSN:
1664-8021
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Our knowledge of the contribution of genetic interactions (epistasis) to variation in human complex traits remains limited, partly due to the lack of efficient, powerful, and interpretable algorithms to detect interactions. Recently proposed approaches for set-based association tests show promise in improving the power to detect epistasis by examining the aggregated effects of multiple variants. Nevertheless, these methods either do not scale to large Biobank data sets or lack interpretability. We propose QuadKAST, a scalable algorithm focused on testing pairwise interaction effects (quadratic effects) within small to medium-sized sets of genetic variants (window size ≤100) on a trait and provide quantified interpretation of these effects. Comprehensive simulations show that QuadKAST is well-calibrated. Additionally, QuadKAST is highly sensitive in detecting loci with epistatic signals and accurate in its estimation of quadratic effects. We applied QuadKAST to 52 quantitative phenotypes measured in ≈300,000 unrelated white British individuals in the UK Biobank to test for quadratic effects within each of 9515 protein-coding genes. We detect 32 trait-gene pairs across 17 traits and 29 genes that demonstrate statistically significant signals of quadratic effects (accounting for the number of genes and traits tested). Across these trait-gene pairs, the proportion of trait variance explained by quadratic effects is comparable to additive effects, with five pairs having a ratio >1. Our method enables the detailed investigation of epistasis on a large scale, offering new insights into its role and importance. 
    more » « less
  2. Schwartz, Russell (Ed.)
    Abstract Motivation While gene–environment (GxE) interactions contribute importantly to many different phenotypes, detecting such interactions requires well-powered studies and has proven difficult. To address this, we combine two approaches to improve GxE power: simultaneously evaluating multiple phenotypes and using a two-step analysis approach. Previous work shows that the power to identify a main genetic effect can be improved by simultaneously analyzing multiple related phenotypes. For a univariate phenotype, two-step methods produce higher power for detecting a GxE interaction compared to single step analysis. Therefore, we propose a two-step approach to test for an overall GxE effect for multiple phenotypes. Results Using simulations we demonstrate that, when more than one phenotype has GxE effect (i.e. GxE pleiotropy), our approach offers substantial gain in power (18–43%) to detect an aggregate-level GxE effect for a multivariate phenotype compared to an analogous two-step method to identify GxE effect for a univariate phenotype. We applied the proposed approach to simultaneously analyze three lipids, LDL, HDL and Triglyceride with the frequency of alcohol consumption as environmental factor in the UK Biobank. The method identified two loci with an overall GxE effect on the vector of lipids, one of which was missed by the competing approaches. Availability and implementation We provide an R package MPGE implementing the proposed approach which is available from CRAN: https://cran.r-project.org/web/packages/MPGE/index.html Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract Our knowledge of non-linear genetic effects on complex traits remains limited, in part, due to the modest power to detect such effects. While kernel-based tests offer a versatile approach to test for non-linear relationships between sets of genetic variants and traits, current approaches cannot be applied to Biobank-scale datasets containing hundreds of thousands of individuals. We propose, FastKAST, a kernel-based approach that can test for non-linear effects of a set of variants on a quantitative trait. FastKAST provides calibrated hypothesis tests while enabling analysis of Biobank-scale datasets with hundreds of thousands of unrelated individuals from a homogeneous population. We apply FastKAST to 53 quantitative traits measured across ≈ 300 K unrelated white British individuals in the UK Biobank to detect sets of variants with non-linear effects at genome-wide significance. 
    more » « less
  4. Emerging large-scale biobanks pairing genotype data with phenotype data present new opportunities to prioritize shared genetic associations across multiple phenotypes for molecular validation. Past research, by our group and others, has shown gene-level tests of association produce biologically interpretable characterization of the genetic architecture of a given phenotype. Here, we present a new method, Ward clustering to identify Internal Node branch length outliers using Gene Scores (WINGS), for identifying shared genetic architecture among multiple phenotypes. The objective of WINGS is to identify groups of phenotypes, or “clusters,” sharing a core set of genes enriched for mutations in cases. We validate WINGS using extensive simulation studies and then combine gene-level association tests with WINGS to identify shared genetic architecture among 81 case-control and seven quantitative phenotypes in 349,468 European-ancestry individuals from the UK Biobank. We identify eight prioritized phenotype clusters and recover multiple published gene-level associations within prioritized clusters. 
    more » « less
  5. Lipka, Alexander (Ed.)
    Abstract Many genetic models (including models for epistatic effects as well as genetic-by-environment) involve covariance structures that are Hadamard products of lower rank matrices. Implementing these models require factorizing large Hadamard product matrices. The available algorithms for factorization do not scale well for big data, making the use of some of these models not feasible with large sample sizes. Here, based on properties of Hadamard products and (related) Kronecker products we propose an algorithm that produces an approximate decomposition that is orders of magnitude faster than the standard eigenvalue decomposition. In this article, we describe the algorithm, show how it can be used to factorize large Hadamard product matrices, present benchmarks, and illustrate the use of the method by presenting an analysis of data from the northern testing locations of the G×E project from the Genomes-to-Fields Initiative (n∼60,000). We implemented the proposed algorithm in the open-source ‘tensorEVD’ R-package. 
    more » « less