skip to main content


Title: Efficient variance components analysis across millions of genomes
Abstract

While variance components analysis has emerged as a powerful tool in complex trait genetics, existing methods for fitting variance components do not scale well to large-scale datasets of genetic variation. Here, we present a method for variance components analysis that is accurate and efficient: capable of estimating one hundred variance components on a million individuals genotyped at a million SNPs in a few hours. We illustrate the utility of our method in estimating and partitioning variation in a trait explained by genotyped SNPs (SNP-heritability). Analyzing 22 traits with genotypes from 300,000 individuals across about 8 million common and low frequency SNPs, we observe that per-allele squared effect size increases with decreasing minor allele frequency (MAF) and linkage disequilibrium (LD) consistent with the action of negative selection. Partitioning heritability across 28 functional annotations, we observe enrichment of heritability in FANTOM5 enhancers in asthma, eczema, thyroid and autoimmune disorders.

 
more » « less
Award ID(s):
1943497 1705121
NSF-PAR ID:
10182886
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Volume:
11
Issue:
1
ISSN:
2041-1723
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Heritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide single nucleotide polymorphism (SNP) variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets. Linear mixed models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e. the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens.

    Results

    We propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a method-of-moment estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMBmax( log⁡3N, log⁡3M)).

    We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On standard hardware, our method computes heritability on a dataset of 500 000 individuals and 100 000 SNPs in 38 min.

    Availability and implementation

    The RHE-reg software is made freely available to the research community at: https://github.com/sriramlab/RHE-reg.

     
    more » « less
  2. INTRODUCTION Thousands of genetic variants have been associated with human diseases and traits through genome-wide association studies (GWASs). Translating these discoveries into improved therapeutics requires discerning which variants among hundreds of candidates are causally related to disease risk. To date, only a handful of causal variants have been confirmed. Here, we leverage 100 million years of mammalian evolution to address this major challenge. RATIONALE We compared genomes from hundreds of mammals and identified bases with unusually few variants (evolutionarily constrained). Constraint is a measure of functional importance that is agnostic to cell type or developmental stage. It can be applied to investigate any heritable disease or trait and is complementary to resources using cell type– and time point–specific functional assays like Encyclopedia of DNA Elements (ENCODE) and Genotype-Tissue Expression (GTEx). RESULTS Using constraint calculated across placental mammals, 3.3% of bases in the human genome are significantly constrained, including 57.6% of coding bases. Most constrained bases (80.7%) are noncoding. Common variants (allele frequency ≥ 5%) and low-frequency variants (0.5% ≤ allele frequency < 5%) are depleted for constrained bases (1.85 versus 3.26% expected by chance, P < 2.2 × 10 −308 ). Pathogenic ClinVar variants are more constrained than benign variants ( P < 2.2 × 10 −16 ). The most constrained common variants are more enriched for disease single-nucleotide polymorphism (SNP)–heritability in 63 independent GWASs. The enrichment of SNP-heritability in constrained regions is greater (7.8-fold) than previously reported in mammals and is even higher in primates (11.1-fold). It exceeds the enrichment of SNP-heritability in nonsynonymous coding variants (7.2-fold) and fine-mapped expression quantitative trait loci (eQTL)–SNPs (4.8-fold). The enrichment peaks near constrained bases, with a log-linear decrease of SNP-heritability enrichment as a function of the distance to a constrained base. Zoonomia constraint scores improve functionally informed fine-mapping. Variants at sites constrained in mammals and primates have greater posterior inclusion probabilities and higher per-SNP contributions. In addition, using both constraint and functional annotations improves polygenic risk score accuracy across a range of traits. Finally, incorporating constraint information into the analysis of noncoding somatic variants in medulloblastomas identifies new candidate driver genes. CONCLUSION Genome-wide measures of evolutionary constraint can help discern which variants are functionally important. This information may accelerate the translation of genomic discoveries into the biological, clinical, and therapeutic knowledge that is required to understand and treat human disease. Using evolutionary constraint in genomic studies of human diseases. ( A ) Constraint was calculated across 240 mammal species, including 43 primates (teal line). ( B ) Pathogenic ClinVar variants ( N = 73,885) are more constrained across mammals than benign variants ( N = 231,642; P < 2.2 × 10 −16 ). ( C ) More-constrained bases are more enriched for trait-associated variants (63 GWASs). ( D ) Enrichment of heritability is higher in constrained regions than in functional annotations (left), even in a joint model with 106 annotations (right). ( E ) Fine-mapping (PolyFun) using a model that includes constraint scores identifies an experimentally validated association at rs1421085. Error bars represent 95% confidence intervals. BMI, body mass index; LF, low frequency; PIP, posterior inclusion probability. 
    more » « less
  3. Abstract

    Aggression is a quantitative trait deeply entwined with individual fitness. Mapping the genomic architecture underlying such traits is complicated by complex inheritance patterns, social structure, pedigree information and gene pleiotropy. Here, we leveraged the pedigree of a reintroduced population of grey wolves (Canis lupus) in Yellowstone National Park, Wyoming, USA, to examine the heritability of and the genetic variation associated with aggression. Since their reintroduction, many ecological and behavioural aspects have been documented, providing unmatched records of aggressive behaviour across multiple generations of a wild population of wolves. Using a linear mixed model, a robust genetic relationship matrix, 12,288 single nucleotide polymorphisms (SNPs) and 111 wolves, we estimated the SNP‐based heritability of aggression to be 37% and an additional 14% of the phenotypic variation explained by shared environmental exposures. We identified 598 SNP genotypes from 425 grey wolves to resolve a consensus pedigree that was included in a heritability analysis of 141 individuals with SNP genotype, metadata and aggression data. The pedigree‐based heritability estimate for aggression is 14%, and an additional 16% of the phenotypic variation was explained by shared environmental exposures. We find strong effects of breeding status and relative pack size on aggression. Through an integrative approach, these results provide a framework for understanding the genetic architecture of a complex trait that influences individual fitness, with linkages to reproduction, in a social carnivore. Along with a few other studies, we show here the incredible utility of a pedigreed natural population for dissecting a complex, fitness‐related behavioural trait.

     
    more » « less
  4. Abstract

    As climate changes, understanding the genetic basis of local adaptation in plants becomes an ever more pressing issue. Combining genotype‐environment association (GEA) with genotype–phenotype association (GPA) analysis has an exciting potential to uncover the genetic basis of environmental responses. We use these approaches to identify genetic variants linked to local adaptation to drought inPinus ponderosa. Over 4 million Single Nucleotide Polymorphisms (SNPs) were identified using 223 individuals from across the Sierra Nevada of California. 927,740 (22.3%) SNPs were retained after filtering for proximity to genes and used in our association analyses. We found 1374 associated with five major climate variables, with the largest number (1151) associated with April 1st snowpack. We also conducted a greenhouse study with various drought‐tolerance traits measured in first‐year seedlings of a subset of the genotyped trees grown in the greenhouse. 796 SNPs were associated with control‐condition trait values, while 1149 were associated with responsiveness of these traits to drought. While no individual SNPs were associated with both the environmental variables and the measured traits, several annotated genes were associated with both, particularly those involved in cell wall formation, biotic and abiotic stress responses, and ubiquitination. However, the functions of many of the associated genes have not yet been determined due to the lack of gene annotation information for conifers. Future studies are needed to assess the developmental roles and ecological significance of these unknown genes.

     
    more » « less
  5. SUMMARY

    Genome‐wide association (GWA) studies can identify quantitative trait loci (QTL) putatively underlying traits of interest, and nested association mapping (NAM) can further assess allelic series. Near‐isogenic lines (NILs) can be used to characterize, dissect and validate QTL, but the development of NILs is costly. Previous studies have utilized limited numbers of NILs and introgression donors. We characterized a panel of 1270 maize NILs derived from crosses between 18 diverse inbred lines and the recurrent inbred parent B73, referred to as the nested NILs (nNILs). The nNILs were phenotyped for flowering time, height and resistance to three foliar diseases, and genotyped with genotyping‐by‐sequencing. Across traits, broad‐sense heritability (0.4–0.8) was relatively high. The 896 genotyped nNILs contain 2638 introgressions, which span the entire genome with substantial overlap within and among allele donors. GWA with the whole panel identified 29 QTL for height and disease resistance with allelic variation across donors. To date, this is the largest and most diverse publicly available panel of maize NILs to be phenotypically and genotypically characterized. The nNILs are a valuable resource for the maize community, providing an extensive collection of introgressions from the founders of the maize NAM population in a B73 background combined with data on six agronomically important traits and from genotyping‐by‐sequencing. We demonstrate that the nNILs can be used for QTL mapping and allelic testing. The majority of nNILs had four or fewer introgressions, and could readily be used for future fine mapping studies.

     
    more » « less