skip to main content

Title: Supervised structural learning of semiparametric regression on high‐dimensional correlated covariates with applications to eQTL studies

Expression quantitative trait loci (eQTL) studies utilize regression models to explain the variance of gene expressions with genetic loci or single nucleotide polymorphisms (SNPs). However, regression models for eQTL are challenged by the presence of high dimensional non‐sparse and correlated SNPs with small effects, and nonlinear relationships between responses and SNPs. Principal component analyses are commonly conducted for dimension reduction without considering responses. Because of that, this non‐supervised learning method often does not work well when the focus is on discovery of the response‐covariate relationship. We propose a new supervised structural dimensional reduction method for semiparametric regression models with high dimensional and correlated covariates; we extract low‐dimensional latent features from a vast number of correlated SNPs while accounting for their relationships, possibly nonlinear, with gene expressions. Our model identifies important SNPs associated with gene expressions and estimates the association parameters via a likelihood‐based algorithm. A GTEx data application on a cancer related gene is presented with 18 novel eQTLs detected by our method. In addition, extensive simulations show that our method outperforms the other competing methods in bias, efficiency, and computational cost.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistics in Medicine
Page Range / eLocation ID:
p. 3145-3163
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Genome‐wide expression quantitative trait loci (eQTLs) mapping explores the relationship between gene expression and DNA variants, such as single‐nucleotide polymorphism (SNPs), to understand genetic basis of human diseases. Due to the large number of genes and SNPs that need to be assessed, current methods for eQTL mapping often suffer from low detection power, especially for identifyingtrans‐eQTLs. In this paper, we propose the idea of performing SNP ranking based on the higher criticism statistic, a summary statistic developed in large‐scale signal detection. We illustrate how the HC‐based SNP ranking can effectively prioritize eQTL signals over noise, greatly reduce the burden of joint modeling, and improve the power for eQTL mapping. Numerical results in simulation studies demonstrate the superior performance of our method compared to existing methods. The proposed method is also evaluated in HapMap eQTL data analysis and the results are compared to a database of known eQTLs.

    more » « less
  2. Abstract

    Genetic association signals have been mostly found in noncoding regions through genome-wide association studies (GWAS), suggesting the roles of gene expression regulation in human diseases and traits. However, there has been limited success in colocalizing expression quantitative trait locus (eQTL) with disease-associated variants. Mediated expression score regression (MESC) is a recently proposed method to quantify the proportion of trait heritability mediated by genetically regulated gene expressions (GReX). Applications of MESC to GWAS results have yielded low estimation of mediated heritability for many traits. As MESC relies on stringent independence assumptions between cis-eQTL effects, gene effects, and nonmediated SNP effects, it may fail to characterize the true relationships between those effect sizes, which leads to biased results. Here, we consider the robustness of MESC to investigate whether the low fraction of mediated heritability inferred by MESC reflects biological reality for complex traits or is an underestimation caused by model misspecifications. Our results suggest that MESC may lead to biased estimates of mediated heritability with misspecification of gene annotations leading to underestimation, whereas misspecification of SNP annotations may lead to overestimation. Furthermore, errors in eQTL effect estimates may lead to underestimation of mediated heritability.

    more » « less
  3. Abstract

    Genome‐wide association studies (GWAS) are used to investigate genetic variants contributing to complex traits. Despite discovering many loci, a large proportion of “missing” heritability remains unexplained. Gene–gene interactions may help explain some of this gap. Traditionally, gene–gene interactions have been evaluated using parametric statistical methods such as linear and logistic regression, with multifactor dimensionality reduction (MDR) used to address sparseness of data in high dimensions. We propose a method for the analysis of gene–gene interactions across independent single‐nucleotide polymorphisms (SNPs) in two genes. Typical methods for this problem use statistics based on an asymptotic chi‐squared mixture distribution, which is not easy to use. Here, we propose a Kullback–Leibler‐type statistic, which follows an asymptotic, positive, normal distribution under the null hypothesis of no relationship between SNPs in the two genes, and normally distributed under the alternative hypothesis. The performance of the proposed method is evaluated by simulation studies, which show promising results. The method is also used to analyze real data and identifies gene–gene interactions amongRAB3A,MADD, andPTPRNon type 2 diabetes (T2D) status.

    more » « less
  4. Abstract Background Large-scale genome-wide association studies have successfully identified many genetic variants significantly associated with Alzheimer’s disease (AD), such as rs429358, rs11038106, rs723804, rs13591776, and more. The next key step is to understand the function of these SNPs and the downstream biology through which they exert the effect on the development of AD. However, this remains a challenging task due to the tissue-specific nature of transcriptomic and proteomic data and the limited availability of brain tissue.In this paper, instead of using coupled transcriptomic data, we performed an integrative analysis of existing GWAS findings and expression quantitative trait loci (eQTL) results from AD-related brain regions to estimate the transcriptomic alterations in AD brain. Results We used summary-based mendelian randomization method along with heterogeneity in dependent instruments method and were able to identify 32 genes with potential altered levels in temporal cortex region. Among these, 10 of them were further validated using real gene expression data collected from temporal cortex region, and 19 SNPs from NECTIN and TOMM40 genes were found associated with multiple temporal cortex imaging phenotype. Conclusion Significant pathways from enriched gene networks included neutrophil degranulation, Cell surface interactions at the vascular wall, and Regulation of TP53 activity which are still relatively under explored in Alzheimer’s Disease while also encouraging a necessity to bind further trans-eQTL effects into this integrative analysis. 
    more » « less
  5. Abstract

    The regulation of gene expression is central to many biological processes. Gene regulatory networks (GRNs) link transcription factors (TFs) to their target genes and represent maps of potential transcriptional regulation. Here, we analyzed a large number of publically available maize (Zea mays) transcriptome data sets including >6000 RNA sequencing samples to generate 45 coexpression-based GRNs that represent potential regulatory relationships between TFs and other genes in different populations of samples (cross-tissue, cross-genotype, and tissue-and-genotype samples). While these networks are all enriched for biologically relevant interactions, different networks capture distinct TF-target associations and biological processes. By examining the power of our coexpression-based GRNs to accurately predict covarying TF-target relationships in natural variation data sets, we found that presence/absence changes rather than quantitative changes in TF gene expression are more likely associated with changes in target gene expression. Integrating information from our TF-target predictions and previous expression quantitative trait loci (eQTL) mapping results provided support for 68 TFs underlying 74 previously identified trans-eQTL hotspots spanning a variety of metabolic pathways. This study highlights the utility of developing multiple GRNs within a species to detect putative regulators of important plant pathways and provides potential targets for breeding or biotechnological applications.

    more » « less