skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: BayesKAT: bayesian optimal kernel-based test for genetic association studies reveals joint genetic effects in complex diseases
Abstract Genome-wide Association Studies (GWAS) methods have identified individual single-nucleotide polymorphisms (SNPs) significantly associated with specific phenotypes. Nonetheless, many complex diseases are polygenic and are controlled by multiple genetic variants that are usually non-linearly dependent. These genetic variants are marginally less effective and remain undetected in GWAS analysis. Kernel-based tests (KBT), which evaluate the joint effect of a group of genetic variants, are therefore critical for complex disease analysis. However, choosing different kernel functions in KBT can significantly influence the type I error control and power, and selecting the optimal kernel remains a statistically challenging task. A few existing methods suffer from inflated type 1 errors, limited scalability, inferior power or issues of ambiguous conclusions. Here, we present a new Bayesian framework, BayesKAT (https://github.com/wangjr03/BayesKAT), which overcomes these kernel specification issues by selecting the optimal composite kernel adaptively from the data while testing genetic associations simultaneously. Furthermore, BayesKAT implements a scalable computational strategy to boost its applicability, especially for high-dimensional cases where other methods become less effective. Based on a series of performance comparisons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations and controlling type I errors simultaneously. Applied on a variety of groups of functionally related genetic variants based on biological pathways, co-expression gene modules and protein complexes, BayesKAT deciphers the complex genetic basis and provides mechanistic insights into human diseases.  more » « less
Award ID(s):
1942143
PAR ID:
10501842
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
25
Issue:
3
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The genome‐wide association studies (GWAS) typically use linear or logistic regression models to identify associations between phenotypes (traits) and genotypes (genetic variants) of interest. However, the use of regression with the additive assumption has potential limitations. First, the normality assumption of residuals is the one that is rarely seen in practice, and deviation from normality increases the Type‐I error rate. Second, building a model based on such an assumption ignores genetic structures, like, dominant, recessive, and protective‐risk cases. Ignoring genetic variants may result in spurious conclusions about the associations between a variant and a trait. We propose an assumption‐free model built upon data‐consistent inversion (DCI), which is a recently developed measure‐theoretic framework utilized for uncertainty quantification. This proposed DCI‐derived model builds a nonparametric distribution on model inputs that propagates to the distribution of observed data without the required normality assumption of residuals in the regression model. This characteristic enables the proposed DCI‐derived model to cover all genetic variants without emphasizing on additivity of the classic‐GWAS model. Simulations and a replication GWAS with data from the COPDGene demonstrate the ability of this model to control the Type‐I error rate at least as well as the classic‐GWAS (additive linear model) approach while having similar or greater power to discover variants in different genetic modes of transmission. 
    more » « less
  2. The MHC region is highly associated with autoimmune and infectious diseases. Here we conduct an in-depth interrogation of associations between genetic variation, gene expression and disease. We create a comprehensive map of regulatory variation in the MHC region using WGS from 419 individuals to call eight-digit HLA types and RNA-seq data from matched iPSCs. Building on this regulatory map, we explored GWAS signals for 4083 traits, detecting colocalization for 180 disease loci with eQTLs. We show that eQTL analyses taking HLA type haplotypes into account have substantially greater power compared with only using single variants. We examined the association between the 8.1 ancestral haplotype and delayed colonization in Cystic Fibrosis, postulating that downregulation of RNF5 expression is the likely causal mechanism. Our study provides insights into the genetic architecture of the MHC region and pinpoints disease associations that are due to differential expression of HLA genes and non-HLA genes. 
    more » « less
  3. Abstract Biobanks often contain several phenotypes relevant to diseases such as major depressive disorder (MDD), with partly distinct genetic architectures. Researchers face complex tradeoffs between shallow (large sample size, low specificity/sensitivity) and deep (small sample size, high specificity/sensitivity) phenotypes, and the optimal choices are often unclear. Here we propose to integrate these phenotypes to combine the benefits of each. We use phenotype imputation to integrate information across hundreds of MDD-relevant phenotypes, which significantly increases genome-wide association study (GWAS) power and polygenic risk score (PRS) prediction accuracy of the deepest available MDD phenotype in UK Biobank, LifetimeMDD. We demonstrate that imputation preserves specificity in its genetic architecture using a novel PRS-based pleiotropy metric. We further find that integration via summary statistics also enhances GWAS power and PRS predictions, but can introduce nonspecific genetic effects depending on input. Our work provides a simple and scalable approach to improve genetic studies in large biobanks by integrating shallow and deep phenotypes. 
    more » « less
  4. SARS-CoV-2 has caused symptomatic COVID-19 and widespread death across the globe. We sought to determine genetic variants contributing to COVID-19 susceptibility and hospitalization in a large biobank linked to a national United States health system. We identified 19,168 (3.7%) lab-confirmed COVID-19 cases among Million Veteran Program participants between March 1, 2020, and February 2, 2021, including 11,778 Whites, 4,893 Blacks, and 2,497 Hispanics. A multi-population genome-wide association study (GWAS) for COVID-19 outcomes identified four independent genetic variants (rs8176719, rs73062389, rs60870724, and rs73910904) contributing to COVID-19 positivity, including one novel locus found exclusively among Hispanics. We replicated eight of nine previously reported genetic associations at an alpha of 0.05 in at least one population-specific or the multi-population meta-analysis for one of the four MVP COVID-19 outcomes. We used rs8176719 and three additional variants to accurately infer ABO blood types. We found that A, AB, and B blood types were associated with testing positive for COVID-19 compared with O blood type with the highest risk for the A blood group. We did not observe any genome-wide significant associations for COVID-19 severity outcomes among those testing positive. Our study replicates prior GWAS findings associated with testing positive for COVID-19 among mostly White samples and extends findings at three loci to Black and Hispanic individuals. We also report a new locus among Hispanics requiring further investigation. These findings may aid in the identification of novel therapeutic agents to decrease the morbidity and mortality of COVID-19 across all major ancestral populations. 
    more » « less
  5. null (Ed.)
    Abstract In standard genome-wide association studies (GWAS), the standard association test is underpowered to detect associations between loci with multiple causal variants with small effect sizes. We propose a statistical method, Model-based Association test Reflecting causal Status (MARS), that finds associations between variants in risk loci and a phenotype, considering the causal status of variants, only requiring the existing summary statistics to detect associated risk loci. Utilizing extensive simulated data and real data, we show that MARS increases the power of detecting true associated risk loci compared to previous approaches that consider multiple variants, while controlling the type I error. 
    more » « less