skip to main content


Title: A Reference Genome Sequence for Giant Sequoia
Abstract The giant sequoia (Sequoiadendron giganteum) of California are massive, long-lived trees that grow along the U.S. Sierra Nevada mountains. Genomic data are limited in giant sequoia and producing a reference genome sequence has been an important goal to allow marker development for restoration and management. Using deep-coverage Illumina and Oxford Nanopore sequencing, combined with Dovetail chromosome conformation capture libraries, the genome was assembled into eleven chromosome-scale scaffolds containing 8.125 Gbp of sequence. Iso-Seq transcripts, assembled from three distinct tissues, was used as evidence to annotate a total of 41,632 protein-coding genes. The genome was found to contain, distributed unevenly across all 11 chromosomes and in 63 orthogroups, over 900 complete or partial predicted NLR genes, of which 375 are supported by annotation derived from protein evidence and gene modeling. This giant sequoia reference genome sequence represents the first genome sequenced in the Cupressaceae family, and lays a foundation for using genomic tools to aid in giant sequoia conservation and management.  more » « less
Award ID(s):
1744309
NSF-PAR ID:
10308625
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Date Published:
Journal Name:
G3 Genes|Genomes|Genetics
Volume:
10
Issue:
11
ISSN:
2160-1836
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. INTRODUCTION To faithfully distribute genetic material to daughter cells during cell division, spindle fibers must couple to DNA by means of a structure called the kinetochore, which assembles at each chromosome’s centromere. Human centromeres are located within large arrays of tandemly repeated DNA sequences known as alpha satellite (αSat), which often span millions of base pairs on each chromosome. Arrays of αSat are frequently surrounded by other types of tandem satellite repeats, which have poorly understood functions, along with nonrepetitive sequences, including transcribed genes. Previous genome sequencing efforts have been unable to generate complete assemblies of satellite-rich regions because of their scale and repetitive nature, limiting the ability to study their organization, variation, and function. RATIONALE Pericentromeric and centromeric (peri/centromeric) satellite DNA sequences have remained almost entirely missing from the assembled human reference genome for the past 20 years. Using a complete, telomere-to-telomere (T2T) assembly of a human genome, we developed and deployed tailored computational approaches to reveal the organization and evolutionary patterns of these satellite arrays at both large and small length scales. We also performed experiments to map precisely which αSat repeats interact with kinetochore proteins. Last, we compared peri/centromeric regions among multiple individuals to understand how these sequences vary across diverse genetic backgrounds. RESULTS Satellite repeats constitute 6.2% of the T2T-CHM13 genome assembly, with αSat representing the single largest component (2.8% of the genome). By studying the sequence relationships of αSat repeats in detail across each centromere, we found genome-wide evidence that human centromeres evolve through “layered expansions.” Specifically, distinct repetitive variants arise within each centromeric region and expand through mechanisms that resemble successive tandem duplications, whereas older flanking sequences shrink and diverge over time. We also revealed that the most recently expanded repeats within each αSat array are more likely to interact with the inner kinetochore protein Centromere Protein A (CENP-A), which coincides with regions of reduced CpG methylation. This suggests a strong relationship between local satellite repeat expansion, kinetochore positioning, and DNA hypomethylation. Furthermore, we uncovered large and unexpected structural rearrangements that affect multiple satellite repeat types, including active centromeric αSat arrays. Last, by comparing sequence information from nearly 1600 individuals’ X chromosomes, we observed that individuals with recent African ancestry possess the greatest genetic diversity in the region surrounding the centromere, which sometimes contains a predominantly African αSat sequence variant. CONCLUSION The genetic and epigenetic properties of centromeres are closely interwoven through evolution. These findings raise important questions about the specific molecular mechanisms responsible for the relationship between inner kinetochore proteins, DNA hypomethylation, and layered αSat expansions. Even more questions remain about the function and evolution of non-αSat repeats. To begin answering these questions, we have produced a comprehensive encyclopedia of peri/centromeric sequences in a human genome, and we demonstrated how these regions can be studied with modern genomic tools. Our work also illuminates the rich genetic variation hidden within these formerly missing regions of the genome, which may contribute to health and disease. This unexplored variation underlines the need for more T2T human genome assemblies from genetically diverse individuals. Gapless assemblies illuminate centromere evolution. ( Top ) The organization of peri/centromeric satellite repeats. ( Bottom left ) A schematic portraying (i) evidence for centromere evolution through layered expansions and (ii) the localization of inner-kinetochore proteins in the youngest, most recently expanded repeats, which coincide with a region of DNA hypomethylation. ( Bottom right ) An illustration of the global distribution of chrX centromere haplotypes, showing increased diversity in populations with recent African ancestry. 
    more » « less
  2. Abstract Background

    The Aldabra giant tortoise (Aldabrachelys gigantea) is one of only two giant tortoise species left in the world. The species is endemic to Aldabra Atoll in Seychelles and is listed as Vulnerable on the International Union for Conservation of Nature Red List (v2.3) due to its limited distribution and threats posed by climate change. Genomic resources for A. gigantea are lacking, hampering conservation efforts for both wild and ex situpopulations. A high-quality genome would also open avenues to investigate the genetic basis of the species’ exceptionally long life span.

    Findings

    We produced the first chromosome-level de novo genome assembly of A. gigantea using PacBio High-Fidelity sequencing and high-throughput chromosome conformation capture. We produced a 2.37-Gbp assembly with a scaffold N50 of 148.6 Mbp and a resolution into 26 chromosomes. RNA sequencing–assisted gene model prediction identified 23,953 protein-coding genes and 1.1 Gbp of repetitive sequences. Synteny analyses among turtle genomes revealed high levels of chromosomal collinearity even among distantly related taxa. To assess the utility of the high-quality assembly for species conservation, we performed a low-coverage resequencing of 30 individuals from wild populations and two zoo individuals. Our genome-wide population structure analyses detected genetic population structure in the wild and identified the most likely origin of the zoo-housed individuals. We further identified putatively deleterious mutations to be monitored.

    Conclusions

    We establish a high-quality chromosome-level reference genome for A. gigantea and one of the most complete turtle genomes available. We show that low-coverage whole-genome resequencing, for which alignment to the reference genome is a necessity, is a powerful tool to assess the population structure of the wild population and reveal the geographic origins of ex situ individuals relevant for genetic diversity management and rewilding efforts.

     
    more » « less
  3. Abstract Background The release of the first reference genome of walnut (Juglans regia L.) enabled many achievements in the characterization of walnut genetic and functional variation. However, it is highly fragmented, preventing the integration of genetic, transcriptomic, and proteomic information to fully elucidate walnut biological processes. Findings Here, we report the new chromosome-scale assembly of the walnut reference genome (Chandler v2.0) obtained by combining Oxford Nanopore long-read sequencing with chromosome conformation capture (Hi-C) technology. Relative to the previous reference genome, the new assembly features an 84.4-fold increase in N50 size, with the 16 chromosomal pseudomolecules assembled and representing 95% of its total length. Using full-length transcripts from single-molecule real-time sequencing, we predicted 37,554 gene models, with a mean gene length higher than the previous gene annotations. Most of the new protein-coding genes (90%) present both start and stop codons, which represents a significant improvement compared with Chandler v1.0 (only 48%). We then tested the potential impact of the new chromosome-level genome on different areas of walnut research. By studying the proteome changes occurring during male flower development, we observed that the virtual proteome obtained from Chandler v2.0 presents fewer artifacts than the previous reference genome, enabling the identification of a new potential pollen allergen in walnut. Also, the new chromosome-scale genome facilitates in-depth studies of intraspecies genetic diversity by revealing previously undetected autozygous regions in Chandler, likely resulting from inbreeding, and 195 genomic regions highly differentiated between Western and Eastern walnut cultivars. Conclusion Overall, Chandler v2.0 will serve as a valuable resource to better understand and explore walnut biology. 
    more » « less
  4. Abstract Hares (genus Lepus) provide clear examples of repeated and often massive introgressive hybridization and striking local adaptations. Genomic studies on this group have so far relied on comparisons to the European rabbit (Oryctolagus cuniculus) reference genome. Here, we report the first de novo draft reference genome for a hare species, the mountain hare (Lepus timidus), and evaluate the efficacy of whole-genome re-sequencing analyses using the new reference versus using the rabbit reference genome. The genome was assembled using the ALLPATHS-LG protocol with a combination of overlapping pair and mate-pair Illumina sequencing (77x coverage). The assembly contained 32,294 scaffolds with a total length of 2.7 Gb and a scaffold N50 of 3.4 Mb. Re-scaffolding based on the rabbit reference reduced the total number of scaffolds to 4,205 with a scaffold N50 of 194 Mb. A correspondence was found between 22 of these hare scaffolds and the rabbit chromosomes, based on gene content and direct alignment. We annotated 24,578 protein coding genes by combining ab-initio predictions, homology search, and transcriptome data, of which 683 were solely derived from hare-specific transcriptome data. The hare reference genome is therefore a new resource to discover and investigate hare-specific variation. Similar estimates of heterozygosity and inferred demographic history profiles were obtained when mapping hare whole-genome re-sequencing data to the new hare draft genome or to alternative references based on the rabbit genome. Our results validate previous reference-based strategies and suggest that the chromosome-scale hare draft genome should enable chromosome-wide analyses and genome scans on hares. 
    more » « less
  5. null (Ed.)
    The family Asfarviridae is a group of nucleo-cytoplasmic large DNA viruses (NCLDVs) of which African swine fever virus (ASFV) is well-characterized. Recently the discovery of several Asfarviridae members other than ASFV has suggested that this family represents a diverse and cosmopolitan group of viruses, but the genomics and distribution of this family have not been studied in detail. To this end we analyzed five complete genomes and 35 metagenome-assembled genomes (MAGs) of viruses from this family to shed light on their evolutionary relationships and environmental distribution. The Asfarvirus MAGs derive from diverse marine, freshwater, and terrestrial habitats, underscoring the broad environmental distribution of this family. We present phylogenetic analyses using conserved marker genes and whole-genome comparison of pairwise average amino acid identity (AAI) values, revealing a high level of genomic divergence across disparate Asfarviruses. Further, we found that Asfarviridae genomes encode genes with diverse predicted metabolic roles and detectable sequence homology to proteins in bacteria, archaea, and eukaryotes, highlighting the genomic chimerism that is a salient feature of NCLDV. Our read mapping from Tara oceans metagenomic data also revealed that three Asfarviridae MAGs were present in multiple marine samples, indicating that they are widespread in the ocean. In one of these MAGs we identified four marker genes with > 95% AAI to genes sequenced from a virus that infects the dinoflagellate Heterocapsa circularisquama (HcDNAV). This suggests a potential host for this MAG, which would thereby represent a reference genome of a dinoflagellate-infecting giant virus. Together, these results show that Asfarviridae are ubiquitous, comprise similar sequence divergence as other NCLDV families, and include several members that are widespread in the ocean and potentially infect ecologically important protists. 
    more » « less