skip to main content

Title: Speeding up Monte Carlo simulations for the adaptive sum of powered score test with importance sampling

A central but challenging problem in genetic studies is to test for (usually weak) associations between a complex trait (e.g., a disease status) and sets of multiple genetic variants. Due to the lack of a uniformly most powerful test, data‐adaptive tests, such as the adaptive sum of powered score (aSPU) test, are advantageous in maintaining high power against a wide range of alternatives. However, there is often no closed‐form to accurately and analytically calculate thep‐values of many adaptive tests like aSPU, thus Monte Carlo (MC) simulations are often used, which can be time consuming to achieve a stringent significance level (e.g., 5e‐8) used in genome‐wide association studies (GWAS). To estimate such a smallp‐value, we need a huge number of MC simulations (e.g., 1e+10). As an alternative, we propose using importance sampling to speed up such calculations. We develop some theory to motivate a proposed algorithm for the aSPU test, and show that the proposed method is computationally more efficient than the standard MC simulations. Using both simulated and real data, we demonstrate the superior performance of the new method over the standard MC simulations.

more » « less
Award ID(s):
1846747 1712717 1659328
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Medium: X Size: p. 261-273
["p. 261-273"]
Sponsoring Org:
National Science Foundation
More Like this
  1. The explosion of biobank data offers unprecedented opportunities for gene-environment interaction (GxE) studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in G×E assessment, especially for set-based G×E variance component (VC) tests, which are a widely used strategy to boost overall G×E signals and to evaluate the joint G×E effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a S calable E xact A l G orithm for L arge-scale set-based G× E tests, to permit G×E VC tests for biobank-scale data. SEAGLE employs modern matrix computations to calculate the test statistic and p -value of the GxE VC test in a computationally efficient fashion, without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of 10 5 , is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate the performance of SEAGLE using extensive simulations. We illustrate its utility by conducting genome-wide gene-based G×E analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index. 
    more » « less
  2. Abstract

    Inferring the frequency and mode of hybridization among closely related organisms is an important step for understanding the process of speciation and can help to uncover reticulated patterns of phylogeny more generally. Phylogenomic methods to test for the presence of hybridization come in many varieties and typically operate by leveraging expected patterns of genealogical discordance in the absence of hybridization. An important assumption made by these tests is that the data (genes or SNPs) are independent given the species tree. However, when the data are closely linked, it is especially important to consider their nonindependence. Recently, deep learning techniques such as convolutional neural networks (CNNs) have been used to perform population genetic inferences with linked SNPs coded as binary images. Here, we use CNNs for selecting among candidate hybridization scenarios using the tree topology (((P1,P2),P3), Out) and a matrix of pairwise nucleotide divergence (dXY) calculated in windows across the genome. Using coalescent simulations to train and independently test a neural network showed that our method, HyDe‐CNN, was able to accurately perform model selection for hybridization scenarios across a wide breath of parameter space. We then used HyDe‐CNN to test models of admixture inHeliconiusbutterflies, as well as comparing it to phylogeny‐based introgression statistics. Given the flexibility of our approach, the dropping cost of long‐read sequencing and the continued improvement of CNN architectures, we anticipate that inferences of hybridization using deep learning methods like ours will help researchers to better understand patterns of admixture in their study organisms.

    more » « less
  3. Abstract Motivation

    CpG sites within the same genomic region often share similar methylation patterns and tend to be co-regulated by multiple genetic variants that may interact with one another.


    We propose a multi-trait methylation random field (multi-MRF) method to evaluate the joint association between a set of CpG sites and a set of genetic variants. The proposed method has several advantages. First, it is a multi-trait method that allows flexible correlation structures between neighboring CpG sites (e.g. distance-based correlation). Second, it is also a multi-locus method that integrates the effect of multiple common and rare genetic variants. Third, it models the methylation traits with a beta distribution to characterize their bimodal and interval properties. Through simulations, we demonstrated that the proposed method had improved power over some existing methods under various disease scenarios. We further illustrated the proposed method via an application to a study of congenital heart defects (CHDs) with 83 cardiac tissue samples. Our results suggested that gene BACE2, a methylation quantitative trait locus (QTL) candidate, colocalized with expression QTLs in artery tibial and harbored genetic variants with nominal significant associations in two genome-wide association studies of CHD.

    Availability and implementation

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  4. Abstract

    Modeling and drawing inference on the joint associations between single‐nucleotide polymorphisms and a disease has sparked interest in genome‐wide associations studies. In the motivating Boston Lung Cancer Survival Cohort (BLCSC) data, the presence of a large number of single nucleotide polymorphisms of interest, though smaller than the sample size, challenges inference on their joint associations with the disease outcome. In similar settings, we find that neither the debiased lasso approach (van de Geer et al., 2014), which assumes sparsity on the inverse information matrix, nor the standard maximum likelihood method can yield confidence intervals with satisfactory coverage probabilities for generalized linear models. Under this “largen, divergingp” scenario, we propose an alternative debiased lasso approach by directly inverting the Hessian matrix without imposing the matrix sparsity assumption, which further reduces bias compared to the original debiased lasso and ensures valid confidence intervals with nominal coverage probabilities. We establish the asymptotic distributions of any linear combinations of the parameter estimates, which lays the theoretical ground for drawing inference. Simulations show that the proposedrefineddebiased estimating method performs well in removing bias and yields honest confidence interval coverage. We use the proposed method to analyze the aforementioned BLCSC data, a large‐scale hospital‐based epidemiology cohort study investigating the joint effects of genetic variants on lung cancer risks.

    more » « less
  5. Abstract Aim

    To test whether or not fungal communities associated with the widespread seagrass,Syringodium isoetifoliumcan be differentiated on either side of Wallace's line, a boundary line separating Asian and Australasian fauna. Additionally, we examine whether host multilocus genotype predicts fungal community composition.


    A total of 77 samples were collected from 14 sampling sites spanning the Indonesian archipelago.


    We sequenced the fungalITS1 gene using Illumia MiSeq technology and used a clustering‐free Divisive Amplicon Denoising Algorithm to infer ribosomal sequence variants. Data were analysed via non‐metric multidimensional scaling, Mantel tests and permutational multivariate analysis of variance. Binary and quantitative null models were used to determine whether results significantly deviated from random. Host genotype was determined by genotyping at 18 microsatellite loci and standard genetic analysis was performed in the R packageAPE.


    Significant differences in fungal community composition were detected on either side of Wallace's line (= <.001R2 = .040). A significant distance decay of similarity pattern was observed between ribosomal sequence variants and geographical distance (= .001R2 = .227) and several fungal ribosomal sequence variants were significantly associated with sampling sites found either east or west of Wallace's line.

    Main conclusions

    Fungi are generally considered to have excellent dispersal potentials and marine fungi have the potential to disperse far and wide in an environment that has no obvious barriers to dispersal. Despite this assumed excellent dispersal potential, we show that fungal communities on either side of Wallace's line are significantly different from one another. We speculate that limited dispersal and differences in habitat type are responsible for the observed pattern. Work examining biogeographical patterns in marine fungi is still in its infancy and further research is required to fully understand marine fungal biogeography.

    more » « less