skip to main content


Title: Benchmarking Oxford Nanopore read assemblers for high-quality molluscan genomes
Choosing the optimum assembly approach is essential to achieving a high-quality genome assembly suitable for comparative and evolutionary genomic investigations. Significant recent progress in long-read sequencing technologies such as PacBio and Oxford Nanopore Technologies (ONT) has also brought about a large variety of assemblers. Although these have been extensively tested on model species such as Homo sapiens and Drosophila melanogaster , such benchmarking has not been done in Mollusca, which lacks widely adopted model species. Molluscan genomes are notoriously rich in repeats and are often highly heterozygous, making their assembly challenging. Here, we benchmarked 10 assemblers based on ONT raw reads from two published molluscan genomes of differing properties, the gastropod Chrysomallon squamiferum (356.6 Mb, 1.59% heterozygosity) and the bivalve Mytilus coruscus (1593 Mb, 1.94% heterozygosity). By optimizing the assembly pipeline, we greatly improved both genomes from previously published versions. Our results suggested that 40–50X of ONT reads are sufficient for high-quality genomes, with Flye being the recommended assembler for compact and less heterozygous genomes exemplified by C. squamiferum , while NextDenovo excelled for more repetitive and heterozygous molluscan genomes exemplified by M. coruscus . A phylogenomic analysis using the two updated genomes with 32 other published high-quality lophotrochozoan genomes resulted in maximum support across all nodes, and we show that improved genome quality also leads to more complete matrices for phylogenomic inferences. Our benchmarking will ensure efficiency in future assemblies for molluscs and perhaps also for other marine phyla with few genomes available. This article is part of the Theo Murphy meeting issue ‘Molluscan genomics: broad insights and future directions for a neglected phylum’.  more » « less
Award ID(s):
1846174
NSF-PAR ID:
10290884
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Philosophical Transactions of the Royal Society B: Biological Sciences
Volume:
376
Issue:
1825
ISSN:
0962-8436
Page Range / eLocation ID:
20200160
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Udall, J (Ed.)
    Abstract Availability of readily transformable germplasm, as well as efficient pipelines for gene discovery are notable bottlenecks in the application of genome editing in potato. To study and introduce traits such as resistance against biotic and abiotic factors, tuber quality traits and self-fertility, model germplasm that is amenable to gene editing and regeneration is needed. Cultivated potato is a heterozygous autotetraploid and its genetic redundancy and complexity makes studying gene function challenging. Genome editing is simpler at the diploid level, with fewer allelic variants to consider. A readily transformable diploid potato would be further complemented by genomic resources that could aid in high throughput functional analysis. The heterozygous Solanum tuberosum Group Phureja clone 1S1 has a high regeneration rate, self-fertility, desirable tuber traits and is amenable to Agrobacterium-mediated transformation. We leveraged its amenability to Agrobacterium-mediated transformation to create a Cas9 constitutively expressing line for use in viral vector-based gene editing. To create a contiguous genome assembly, a homozygous doubled monoploid of 1S1 (DM1S1) was sequenced using 44 Gbp of long reads generated from Oxford Nanopore Technologies (ONT), yielding a 736 Mb assembly that encoded 31,145 protein-coding genes. The final assembly for DM1S1 represents a nearly complete genic space, shown by the presence of 99.6% of the genes in the Benchmarking Universal Single Copy Orthologs (BUSCO) set. Variant analysis with Illumina reads from 1S1 was used to deduce its alternate haplotype. These genetic and genomic resources provide a toolkit for applications of genome editing in both basic and applied research of potato. 
    more » « less
  2. Genome assembly can be challenging for species that are characterized by high amounts of polymorphism, heterozygosity, and large effective population sizes. High levels of heterozygosity can result in genome mis-assemblies and a larger than expected genome size due to the haplotig versions of a single locus being assembled as separate loci. Here, we describe the first chromosome-level genome for the eastern oyster, Crassostrea virginica. Publicly released and annotated in 2017, the assembly has a scaffold N50 of 54 mb and is over 97.3% complete based on BUSCO analysis. The genome assembly for the eastern oyster is a critical resource for foundational research into molluscan adaptation to a changing environment and for selective breeding for the aquaculture industry. Subsequent resequencing data suggested the presence of haplotigs in the original assembly, and we developed a post hoc method to break up chimeric contigs and mask haplotigs in published heterozygous genomes and evaluated improvements to the accuracy of downstream analysis. Masking haplotigs had a large impact on SNP discovery and estimates of nucleotide diversity and had more subtle and nuanced effects on estimates of heterozygosity, population structure analysis, and outlier detection. We show that haplotig masking can be a powerful tool for improving genomic inference, and we present an open, reproducible resource for the masking of haplotigs in any published genome. 
    more » « less
  3. Mollusca is the second most species-rich phylum and includes animals as disparate as octopuses, clams, and chitons. Dozens of molluscan genomes are available, but only one representative of the subphylum Aculifera, the sister taxon to all other molluscs, has been sequenced to date, hindering comparative and evolutionary studies. To facilitate evolutionary studies across Mollusca, we sequenced the genome of a second aculiferan mollusc, the lepidopleurid chiton Hanleya hanleyi (Bean 1844), using a hybrid approach combining Oxford Nanopore and Illumina reads. After purging redundant haplotigs and removing contamination from this 1.3% heterozygous genome, we produced a 2.5 Gbp haploid assembly (>4X the size of the other chiton genome sequenced to date) with an N50 of 65.0 Kbp. Despite a fragmented assembly, the genome is rather complete (92.0% of BUSCOs detected; 79.4% complete plus 12.6% fragmented). Remarkably, the genome has the highest repeat content of any molluscan genome reported to date (>66%). Our gene annotation pipeline predicted 69,284 gene models (92.9% of BUSCOs detected; 81.8% complete plus 11.1% fragmented) of which 35,362 were supported by transcriptome and/or protein evidence. Phylogenomic analysis recovered Polyplacophora sister to all other sampled molluscs with maximal support. The Hanleya genome will be a valuable resource for studies of molluscan biology with diverse potential applications ranging from evolutionary and comparative genomics to molecular ecology. 
    more » « less
  4. Hoffmann, Federico (Ed.)
    Abstract The first insect genome assembly (Drosophila melanogaster) was published two decades ago. Today, nuclear genome assemblies are available for a staggering 601 insect species representing 20 orders. In this study, we analyzed the most-contiguous assembly for each species and provide a “state-of-the-field” perspective, emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technologies. Relative to species richness, genomic efforts have been biased toward four orders (Diptera, Hymenoptera, Collembola, and Phasmatodea), Coleoptera are underrepresented, and 11 orders still lack a publicly available genome assembly. The average insect genome assembly is 439.2 Mb in length with 87.5% of single-copy benchmarking genes intact. Most notable has been the impact of long-read sequencing; assemblies that incorporate long reads are ∼48× more contiguous than those that do not. We offer four recommendations as we collectively continue building insect genome resources: 1) seek better integration between independent research groups and consortia, 2) balance future sampling between filling taxonomic gaps and generating data for targeted questions, 3) take advantage of long-read sequencing technologies, and 4) expand and improve gene annotations. 
    more » « less
  5. Abstract

    The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using long noisy reads. Here, we present PECAT, aPhasedErrorCorrection andAssemblyTool, for reconstructing diploid genomes from long noisy reads. We design a haplotype-aware error correction method that can retain heterozygote alleles while correcting sequencing errors. We combine a corrected read SNP caller and a raw read SNP caller to further improve the identification of inconsistent overlaps in the string graph. We use a grouping method to assign reads to different haplotype groups. PECAT efficiently assembles diploid genomes using Nanopore R9, PacBio CLR or Nanopore R10 reads only. PECAT generates more contiguous haplotype-specific contigs compared to other assemblers. Especially, PECAT achieves nearly haplotype-resolved assembly onB. taurus(Bison×Simmental) using Nanopore R9 reads and phase block NG50 with 59.4/58.0 Mb for HG002 using Nanopore R10 reads.

     
    more » « less