skip to main content

Title: The genome of the American groundhog, Marmota monax
We sequenced the genome of the North American groundhog, Marmota monax , also known as the woodchuck. Our sequencing strategy included a combination of short, high-quality Illumina reads plus long reads generated by both Pacific Biosciences and Oxford Nanopore instruments. Assembly of the combined data produced a genome of 2.74 Gbp in total length, with an N50 contig size of 1,094,236 bp. To annotate the genome, we mapped the genes from another M. monax genome and from the closely related Alpine marmot, Marmota marmota , onto our assembly, resulting in 20,559 annotated protein-coding genes and 28,135 transcripts. The genome assembly and annotation are available in GenBank under BioProject PRJNA587092 .
; ; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Sponsoring Org:
National Science Foundation
More Like this
  1. Bread wheat (Triticum aestivum) is a major food crop and an important plant system for agricultural genetics research. However, due to the complexity and size of its allohexaploid genome, genomic resources are limited compared to other major crops. The IWGSC recently published a reference genome and associated annotation (IWGSC CS v1.0, Chinese Spring) that has been widely adopted and utilized by the wheat community. Although this reference assembly represents all three wheat subgenomes at chromosome-scale, it was derived from short reads, and thus is missing a substantial portion of the expected 16 Gbp of genomic sequence. We earlier published an independent wheat assembly (Triticum_aestivum_3.1, Chinese Spring) that came much closer in length to the expected genome size, although it was only a contig-level assembly lacking gene annotations. Here, we describe a reference-guided effort to scaffold those contigs into chromosome-length pseudomolecules, add in any missing sequence that was unique to the IWGSC CS v1.0 assembly, and annotate the resulting pseudomolecules with genes. Our updated assembly, Triticum_aestivum_4.0, contains 15.07 Gbp of non-gap sequence anchored to chromosomes, which is 1.2 Gbps more than the previous reference assembly. It includes 108,639 genes unambiguously localized to chromosomes, including over 2,000 genes that were previously unplaced.more »We also discovered more than 5,700 additional gene copies, facilitating the accurate annotation of functional gene duplications including at the Ppd-B1 photoperiod response locus.« less
  2. Abstract Background Modern sequencing technologies should make the assembly of the relatively small mitochondrial genomes an easy undertaking. However, few tools exist that address mitochondrial assembly directly. Results As part of the Vertebrate Genomes Project (VGP) we develop mitoVGP, a fully automated pipeline for similarity-based identification of mitochondrial reads and de novo assembly of mitochondrial genomes that incorporates both long (> 10 kbp, PacBio or Nanopore) and short (100–300 bp, Illumina) reads. Our pipeline leads to successful complete mitogenome assemblies of 100 vertebrate species of the VGP. We observe that tissue type and library size selection have considerable impact on mitogenome sequencing and assembly. Comparing our assemblies to purportedly complete reference mitogenomes based on short-read sequencing, we identify errors, missing sequences, and incomplete genes in those references, particularly in repetitive regions. Our assemblies also identify novel gene region duplications. The presence of repeats and duplications in over half of the species herein assembled indicates that their occurrence is a principle of mitochondrial structure rather than an exception, shedding new light on mitochondrial genome evolution and organization. Conclusions Our results indicate that even in the “simple” case of vertebrate mitogenomes the completeness of many currently available reference sequences can be further improved, and cautionmore »should be exercised before claiming the complete assembly of a mitogenome, particularly from short reads alone.« less
  3. Abstract The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society 1,2 . However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals 3,4 . Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome 5 . To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity 6 . Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genesmore »have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.« less
  4. Abstract Taro (Colocasia esculenta) is a food staple widely cultivated in the humid tropics of Asia, Africa, Pacific and the Caribbean. One of the greatest threats to taro production is Taro Leaf Blight caused by the oomycete pathogen Phytophthora colocasiae. Here we describe a de novo taro genome assembly and use it to analyze sequence data from a Taro Leaf Blight resistant mapping population. The genome was assembled from linked-read sequences (10x Genomics; ∼60x coverage) and gap-filled and scaffolded with contigs assembled from Oxford Nanopore Technology long-reads and linkage map results. The haploid assembly was 2.45 Gb total, with a maximum contig length of 38 Mb and scaffold N50 of 317,420 bp. A comparison of family-level (Araceae) genome features reveals the repeat content of taro to be 82%, >3.5x greater than in great duckweed (Spirodela polyrhiza), 23%. Both genomes recovered a similar percent of Benchmarking Universal Single-copy Orthologs, 80% and 84%, based on a 3,236 gene database for monocot plants. A greater number of nucleotide-binding leucine-rich repeat disease resistance genes were present in genomes of taro than the duckweed, ∼391 vs. ∼70 (∼182 and ∼46 complete). The mapping population data revealed 16 major linkage groups with 520 markers, and 10more »quantitative trait loci (QTL) significantly associated with Taro Leaf Blight disease resistance. The genome sequence of taro enhances our understanding of resistance to TLB, and provides markers that may accelerate breeding programs. This genome project may provide a template for developing genomic resources in other understudied plant species.« less
  5. Abstract Background The pond snail, Lymnaea stagnalis ( L. stagnalis ), has served as a valuable model organism for neurobiology studies due to its simple and easily accessible central nervous system (CNS). L. stagnalis has been widely used to study neuronal networks and recently gained popularity for study of aging and neurodegenerative diseases. However, previous transcriptome studies of L. stagnalis CNS have been exclusively carried out on adult L. stagnalis only. As part of our ongoing effort studying L. stagnalis neuronal growth and connectivity at various developmental stages, we provide the first age-specific transcriptome analysis and gene annotation of young (3 months), adult (6 months), and old (18 months) L. stagnalis CNS. Results Using the above three age cohorts, our study generated 55–69 millions of 150 bp paired-end RNA sequencing reads using the Illumina NovaSeq 6000 platform. Of these reads, ~ 74% were successfully mapped to the reference genome of L. stagnalis . Our reference-based transcriptome assembly predicted 42,478 gene loci, of which 37,661 genes encode coding sequences (CDS) of at least 100 codons. In addition, we provide gene annotations using Blast2GO and functional annotations using Pfam for ~ 95% of these sequences, contributing to the largest number of annotated genes in L. stagnalis CNS somore »far. Moreover, among 242 previously cloned L. stagnalis genes, we were able to match ~ 87% of them in our transcriptome assembly, indicating a high percentage of gene coverage. The expressional differences for innexins, FMRFamide, and molluscan insulin peptide genes were validated by real-time qPCR. Lastly, our transcriptomic analyses revealed distinct, age-specific gene clusters, differentially expressed genes, and enriched pathways in young, adult, and old CNS. More specifically, our data show significant changes in expression of critical genes involved in transcription factors, metabolisms (e.g. cytochrome P450), extracellular matrix constituent, and signaling receptor and transduction (e.g. receptors for acetylcholine, N-Methyl-D-aspartic acid, and serotonin), as well as stress- and disease-related genes in young compared to either adult or old snails. Conclusions Together, these datasets are the largest and most updated L. stagnalis CNS transcriptomes, which will serve as a resource for future molecular studies and functional annotation of transcripts and genes in L. stagnalis .« less