skip to main content

Title: Haplotype associated RNA expression (HARE) improves prediction of complex traits in maize

Genomic prediction typically relies on associations between single-site polymorphisms and traits of interest. This representation of genomic variability has been successful for predicting many complex traits. However, it usually cannot capture the combination of alleles in haplotypes and it has generated little insight about the biological function of polymorphisms. Here we present a novel and cost-effective method for imputingcishaplotype associated RNA expression (HARE), studied their transferability across tissues, and evaluated genomic prediction models within and across populations. HARE focuses on tightly linkedcisacting causal variants in the immediate vicinity of the gene, while excludingtranseffects from diffusion and metabolism. Therefore, HARE estimates were more transferrable across different tissues and populations compared to measured transcript expression. We also showed that HARE estimates captured one-third of the variation in gene expression. HARE estimates were used in genomic prediction models evaluated within and across two diverse maize panels–a diverse association panel (Goodman Association panel) and a large half-sib panel (Nested Association Mapping panel)–for predicting 26 complex traits. HARE resulted in up to 15% higher prediction accuracy than control approaches that preserved haplotype structure, suggesting that HARE carried functional information in addition to information about haplotype structure. The largest increase was observed when the model was trained in the Nested Association Mapping panel and tested in the Goodman Association panel. Additionally, HARE yielded higher within-population prediction accuracy as compared to measured expression values. The accuracy achieved by measured expression was variable across tissues, whereas accuracy by HARE was more stable across tissues. Therefore, imputing RNA expression of genes by haplotype is stable, cost-effective, and transferable across populations.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Hake, Sarah
Publisher / Repository:
Date Published:
Journal Name:
PLOS Genetics
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Qu, Li-Jia (Ed.)

    Pleiotropy—when a single gene controls two or more seemingly unrelated traits—has been shown to impact genes with effects on flowering time, leaf architecture, and inflorescence morphology in maize. However, the genome-wide impact of biological pleiotropy across all maize phenotypes is largely unknown. Here, we investigate the extent to which biological pleiotropy impacts phenotypes within maize using GWAS summary statistics reanalyzed from previously published metabolite, field, and expression phenotypes across the Nested Association Mapping population and Goodman Association Panel. Through phenotypic saturation of 120,597 traits, we obtain over 480 million significant quantitative trait nucleotides. We estimate that only 1.56–32.3% of intervals show some degree of pleiotropy. We then assess the relationship between pleiotropy and various biological features such as gene expression, chromatin accessibility, sequence conservation, and enrichment for gene ontology terms. We find very little relationship between pleiotropy and these variables when compared to permuted pleiotropy. We hypothesize that biological pleiotropy of common alleles is not widespread in maize and is highly impacted by nuisance terms such as population structure and linkage disequilibrium. Natural selection on large standing natural variation in maize populations may target wide and large effect variants, leaving the prevalence of detectable pleiotropy relatively low.

    more » « less
  2. Abstract

    Accelerating biomass improvement is a major goal ofMiscanthusbreeding. The development and implementation of genomic‐enabled breeding tools, like marker‐assisted selection (MAS) and genomic selection, has the potential to improve the efficiency ofMiscanthusbreeding. The present study conducted genome‐wide association (GWA) and genomic prediction of biomass yield and 14 yield‐components traits inMiscanthus sacchariflorus. We evaluated a diversity panel with 590 accessions ofM. sacchariflorusgrown across 4 years in one subtropical and three temperate locations and genotyped with 268,109 single‐nucleotide polymorphisms (SNPs). The GWA study identified a total of 835 significant SNPs and 674 candidate genes across all traits and locations. Of the significant SNPs identified, 280 were localized in mapped quantitative trait loci intervals and proximal to SNPs identified for similar traits in previously reportedMiscanthusstudies, providing additional support for the importance of these genomic regions for biomass yield. Our study gave insights into the genetic basis for yield‐component traits inM. sacchariflorusthat may facilitate marker‐assisted breeding for biomass yield. Genomic prediction accuracy for the yield‐related traits ranged from 0.15 to 0.52 across all locations and genetic groups. Prediction accuracies within the six genetic groupings ofM. saccharifloruswere limited due to low sample sizes. Nevertheless, the Korea/NE China/Russia (N = 237) genetic group had the highest prediction accuracy of all genetic groups (ranging 0.26–0.71), suggesting that with adequate sample sizes, there is strong potential for genomic selection within the genetic groupings ofM. sacchariflorus. This study indicated that MAS and genomic prediction will likely be beneficial for conducting population‐improvement ofM. sacchariflorus.

    more » « less
  3. Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene–variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual’s genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at . 
    more » « less
  4. Abstract

    Understanding the consequences of exotic diseases on native forests is important to evolutionary ecology and conservation biology because exotic pathogens have drastically altered US eastern deciduous forests.Cornus floridaL. (flowering dogwood tree) is one such species facing heavy mortality. Characterizing the genetic structure ofC. floridapopulations and identifying the genetic signature of adaptation to dogwood anthracnose (an exotic pathogen responsible for high mortality) remain vital for conservation efforts. By integrating genetic data from genotype by sequencing (GBS) of 289 trees across the host species range and distribution of disease, we evaluated the spatial patterns of genetic variation and population genetic structure ofC. floridaand compared the pattern to the distribution of dogwood anthracnose. Using genome‐wide association study and gradient forest analysis, we identified genetic loci under selection and associated with ecological and diseased regions. The results revealed signals of weak genetic differentiation of three or more subgroups nested within two clusters—explaining up to 2%–6% of genetic variation. The groups largely corresponded to the regions within and outside the eastern Hot‐Continental ecoregion, which also overlapped with areas within and outside the main distribution of dogwood anthracnose. The fungal sequences contained in the GBS data of sampled trees bolstered visual records of disease at sampled locations and were congruent with the reported range ofDiscula destructiva, suggesting that fungal sequences within‐host genomic data were informative for detecting or predicting disease. The genetic diversity between populations at diseased vs. disease‐free sites across the range ofC. floridashowed no significant difference. We identified 72 single‐nucleotide polymorphisms (SNPs) from 68 loci putatively under selection, some of which exhibited abrupt turnover in allele frequencies along the borders of the Hot‐Continental ecoregion and the range of dogwood anthracnose. One such candidate SNP was independently identified in two prior studies as a possible L‐type lectin‐domain containing receptor kinase. Although diseased and disease‐free areas do not significantly differ in genetic diversity, overall there are slight trends to indicate marginally smaller amounts of genetic diversity in disease‐affected areas. Our results were congruent with previous studies that were based on a limited number of genetic markers in revealing high genetic variation and weak population structure inC. florida.

    more » « less
  5. Abstract

    Crucial to variety improvement programs is the reliable and accurate prediction of genotype’s performance across environments. However, due to the impactful presence of genotype by environment (G×E) interaction that dictates how changes in expression and function of genes influence target traits in different environments, prediction performance of genomic selection (GS) using single-environment models often falls short. Furthermore, despite the successes of genome-wide association studies (GWAS), the genetic insights derived from genome-to-phenome mapping have not yet been incorporated in predictive analytics, making GS models that use Gaussian kernel primarily an estimator of genomic similarity, instead of the underlying genetics characteristics of the populations. Here, we developed a GS framework that, in addition to capturing the overall genomic relationship, can capitalize on the signal of genetic associations of the phenotypic variation as well as the genetic characteristics of the populations. The capacity of predicting the performance of populations across environments was demonstrated by an overall gain in predictability up to 31% for the winter wheat DH population. Compared to Gaussian kernels, we showed that our multi-environment weighted kernels could better leverage the significance of genetic associations and yielded a marked improvement of 4–33% in prediction accuracy for half-sib families. Furthermore, the flexibility incorporated in our Bayesian implementation provides the generalizable capacity required for predicting multiple highly genetic heterogeneous populations across environments, allowing reliable GS for genetic improvement programs that have no access to genetically uniform material.

    more » « less