skip to main content


Title: A Maize Practical Haplotype Graph Leverages Diverse NAM Assemblies
As a result of millions of years of transposon activity, multiple rounds of ancient polyploidization, and large populations that preserve diversity, maize has an extremely structurally diverse genome, evidenced by high-quality genome assemblies that capture substantial levels of both tropical and temperate diversity. We generated a pangenome representation (the Practical Haplotype Graph, PHG) of these assemblies in a database, representing the pangenome haplotype diversity and providing an initial estimate of structural diversity. We leveraged the pangenome to accurately impute haplotypes and genotypes of taxa using various kinds of sequence data, ranging from WGS to extremely-low coverage GBS. We imputed the genotypes of the recombinant inbred lines of the NAM population with over 99% mean accuracy, while unrelated germplasm attained a mean imputation accuracy of 92 or 95% when using GBS or WGS data, respectively. Most of the imputation errors occur in haplotypes within European or tropical germplasm, which have yet to be represented in the maize PHG database. Also, the PHG stores the imputation data in a 30,000-fold more space-efficient manner than a standard genotype file, which is a key improvement when dealing with large scale data.  more » « less
Award ID(s):
1822330
NSF-PAR ID:
10283585
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
bioRxiv
ISSN:
2692-8205
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Monitoring genetic diversity in wild populations is a central goal of ecological and evolutionary genetics and is critical for conservation biology. However, genetic studies of nonmodel organisms generally lack access to species‐specific genotyping methods (e.g. array‐based genotyping) and must instead use sequencing‐based approaches. Although costs are decreasing, high‐coverage whole‐genome sequencing (WGS), which produces the highest confidence genotypes, remains expensive. More economical reduced representation sequencing approaches fail to capture much of the genome, which can hinder downstream inference. Low‐coverage WGS combined with imputation using a high‐confidence reference panel is a cost‐effective alternative, but the accuracy of genotyping using low‐coverage WGS and imputation in nonmodel populations is still largely uncharacterized. Here, we empirically tested the accuracy of low‐coverage sequencing (0.1–10×) and imputation in two natural populations, one with a large (n = 741) reference panel, rhesus macaques (Macaca mulatta), and one with a smaller (n = 68) reference panel, gelada monkeys (Theropithecus gelada). Using samples sequenced to coverage as low as 0.5×, we could impute genotypes at >95% of the sites in the reference panel with high accuracy (medianr2 ≥ 0.92). We show that low‐coverage imputed genotypes can reliably calculate genetic relatedness and population structure. Based on these data, we also provide best practices and recommendations for researchers who wish to deploy this approach in other populations, with all code available on GitHub (https://github.com/mwatowich/LoCSI‐for‐non‐model‐species). Our results endorse accurate and effective genotype imputation from low‐coverage sequencing, enabling the cost‐effective generation of population‐scale genetic datasets necessary for tackling many pressing challenges of wildlife conservation.

     
    more » « less
  2. Abstract Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals 1 . These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample. 
    more » « less
  3. Koepfli, Klaus-Peter (Ed.)
    Abstract Genomics research has relied principally on the establishment and curation of a reference genome for the species. However, it is increasingly recognized that a single reference genome cannot fully describe the extent of genetic variation within many widely distributed species. Pangenome representations are based on high-quality genome assemblies of multiple individuals and intended to represent the broadest possible diversity within a species. A Bovine Pangenome Consortium (BPC) has recently been established to begin assembling genomes from more than 600 recognized breeds of cattle, together with other related species to provide information on ancestral alleles and haplotypes. Previously reported de novo genome assemblies for Angus, Brahman, Hereford, and Highland breeds of cattle are part of the initial BPC effort. The present report describes a complete single haplotype assembly at chromosome-scale for a fullblood Simmental cow from an F1 bison–cattle hybrid fetus by trio binning. Simmental cattle, also known as Fleckvieh due to their red and white spots, originated in central Europe in the 1830s as a triple-purpose breed selected for draught, meat, and dairy production. There are over 50 million Simmental cattle in the world, known today for their fast growth and beef yields. This assembly (ARS_Simm1.0) is similar in length to the other bovine assemblies at 2.86 Gb, with a scaffold N50 of 102 Mb (max scaffold 156.8 Mb) and meets or exceeds the continuity of the best Bos taurus reference assemblies to date. 
    more » « less
  4. Abstract The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society 1,2 . However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals 3,4 . Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome 5 . To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity 6 . Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements. 
    more » « less
  5. Nathan Springer (Ed.)
    Methyl salicylate is an important inter- and intra-plant signaling molecule, but is deemed undesirable by humans when it accumulates to high levels in ripe fruits. Balancing the tradeoff between consumer satisfaction and overall plant health is challenging as the mechanisms regulating volatile levels have not yet been fully elucidated. In this study, we investigated the accumulation of methyl salicylate in ripe fruits of tomatoes that belong to the red-fruited clade. We determine the genetic diversity and the interaction of four known loci controlling methyl salicylate levels in ripe fruits. In addition to Non-Smoky Glucosyl Transferase 1 (NSGT1), we uncovered extensive genome structural variation (SV) at the Methylesterase (MES) locus. This locus contains four tandemly duplicated Methylesterase genes and genome sequence investigations at the locus identified nine distinct haplotypes. Based on gene expression and results from biparental crosses, functional and non-functional haplotypes for MES were identified. The combination of the non-functional MES haplotype 2 and the non-functional NSGT1 haplotype IV or V in a GWAS panel showed high methyl salicylate levels in ripe fruits, particularly in accessions from Ecuador, demonstrating a strong interaction between these two loci and suggesting an ecological advantage. The genetic variation at the other two known loci, Salicylic Acid Methyl Transferase 1 (SAMT1) and tomato UDP Glycosyl Transferase 5 (SlUGT5), did not explain volatile variation in the red-fruited tomato germplasm, suggesting a minor role in methyl salicylate production in red-fruited tomato. Lastly, we found that most heirloom and modern tomato accessions carried a functional MES and a non-functional NSGT1 haplotype, ensuring acceptable levels of methyl salicylate in fruits. Yet, future selection of the functional NSGT1 allele could potentially improve flavor in the modern germplasm. 
    more » « less