The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, “telomere-to-telomere” genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT “Duplex” sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used “Pore-C” chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.
more »
« less
Benchmarking Oxford Nanopore read assemblers for high-quality molluscan genomes
Choosing the optimum assembly approach is essential to achieving a high-quality genome assembly suitable for comparative and evolutionary genomic investigations. Significant recent progress in long-read sequencing technologies such as PacBio and Oxford Nanopore Technologies (ONT) has also brought about a large variety of assemblers. Although these have been extensively tested on model species such as Homo sapiens and Drosophila melanogaster , such benchmarking has not been done in Mollusca, which lacks widely adopted model species. Molluscan genomes are notoriously rich in repeats and are often highly heterozygous, making their assembly challenging. Here, we benchmarked 10 assemblers based on ONT raw reads from two published molluscan genomes of differing properties, the gastropod Chrysomallon squamiferum (356.6 Mb, 1.59% heterozygosity) and the bivalve Mytilus coruscus (1593 Mb, 1.94% heterozygosity). By optimizing the assembly pipeline, we greatly improved both genomes from previously published versions. Our results suggested that 40–50X of ONT reads are sufficient for high-quality genomes, with Flye being the recommended assembler for compact and less heterozygous genomes exemplified by C. squamiferum , while NextDenovo excelled for more repetitive and heterozygous molluscan genomes exemplified by M. coruscus . A phylogenomic analysis using the two updated genomes with 32 other published high-quality lophotrochozoan genomes resulted in maximum support across all nodes, and we show that improved genome quality also leads to more complete matrices for phylogenomic inferences. Our benchmarking will ensure efficiency in future assemblies for molluscs and perhaps also for other marine phyla with few genomes available. This article is part of the Theo Murphy meeting issue ‘Molluscan genomics: broad insights and future directions for a neglected phylum’.
more »
« less
- Award ID(s):
- 1846174
- PAR ID:
- 10290884
- Date Published:
- Journal Name:
- Philosophical Transactions of the Royal Society B: Biological Sciences
- Volume:
- 376
- Issue:
- 1825
- ISSN:
- 0962-8436
- Page Range / eLocation ID:
- 20200160
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Genome assembly can be challenging for species that are characterized by high amounts of polymorphism, heterozygosity, and large effective population sizes. High levels of heterozygosity can result in genome mis-assemblies and a larger than expected genome size due to the haplotig versions of a single locus being assembled as separate loci. Here, we describe the first chromosome-level genome for the eastern oyster, Crassostrea virginica. Publicly released and annotated in 2017, the assembly has a scaffold N50 of 54 mb and is over 97.3% complete based on BUSCO analysis. The genome assembly for the eastern oyster is a critical resource for foundational research into molluscan adaptation to a changing environment and for selective breeding for the aquaculture industry. Subsequent resequencing data suggested the presence of haplotigs in the original assembly, and we developed a post hoc method to break up chimeric contigs and mask haplotigs in published heterozygous genomes and evaluated improvements to the accuracy of downstream analysis. Masking haplotigs had a large impact on SNP discovery and estimates of nucleotide diversity and had more subtle and nuanced effects on estimates of heterozygosity, population structure analysis, and outlier detection. We show that haplotig masking can be a powerful tool for improving genomic inference, and we present an open, reproducible resource for the masking of haplotigs in any published genome.more » « less
-
Udall, J (Ed.)Abstract Availability of readily transformable germplasm, as well as efficient pipelines for gene discovery are notable bottlenecks in the application of genome editing in potato. To study and introduce traits such as resistance against biotic and abiotic factors, tuber quality traits and self-fertility, model germplasm that is amenable to gene editing and regeneration is needed. Cultivated potato is a heterozygous autotetraploid and its genetic redundancy and complexity makes studying gene function challenging. Genome editing is simpler at the diploid level, with fewer allelic variants to consider. A readily transformable diploid potato would be further complemented by genomic resources that could aid in high throughput functional analysis. The heterozygous Solanum tuberosum Group Phureja clone 1S1 has a high regeneration rate, self-fertility, desirable tuber traits and is amenable to Agrobacterium-mediated transformation. We leveraged its amenability to Agrobacterium-mediated transformation to create a Cas9 constitutively expressing line for use in viral vector-based gene editing. To create a contiguous genome assembly, a homozygous doubled monoploid of 1S1 (DM1S1) was sequenced using 44 Gbp of long reads generated from Oxford Nanopore Technologies (ONT), yielding a 736 Mb assembly that encoded 31,145 protein-coding genes. The final assembly for DM1S1 represents a nearly complete genic space, shown by the presence of 99.6% of the genes in the Benchmarking Universal Single Copy Orthologs (BUSCO) set. Variant analysis with Illumina reads from 1S1 was used to deduce its alternate haplotype. These genetic and genomic resources provide a toolkit for applications of genome editing in both basic and applied research of potato.more » « less
-
Hoffmann, Federico (Ed.)Abstract The first insect genome assembly (Drosophila melanogaster) was published two decades ago. Today, nuclear genome assemblies are available for a staggering 601 insect species representing 20 orders. In this study, we analyzed the most-contiguous assembly for each species and provide a “state-of-the-field” perspective, emphasizing taxonomic representation, assembly quality, gene completeness, and sequencing technologies. Relative to species richness, genomic efforts have been biased toward four orders (Diptera, Hymenoptera, Collembola, and Phasmatodea), Coleoptera are underrepresented, and 11 orders still lack a publicly available genome assembly. The average insect genome assembly is 439.2 Mb in length with 87.5% of single-copy benchmarking genes intact. Most notable has been the impact of long-read sequencing; assemblies that incorporate long reads are ∼48× more contiguous than those that do not. We offer four recommendations as we collectively continue building insect genome resources: 1) seek better integration between independent research groups and consortia, 2) balance future sampling between filling taxonomic gaps and generating data for targeted questions, 3) take advantage of long-read sequencing technologies, and 4) expand and improve gene annotations.more » « less
-
Abstract The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using long noisy reads. Here, we present PECAT, aPhasedErrorCorrection andAssemblyTool, for reconstructing diploid genomes from long noisy reads. We design a haplotype-aware error correction method that can retain heterozygote alleles while correcting sequencing errors. We combine a corrected read SNP caller and a raw read SNP caller to further improve the identification of inconsistent overlaps in the string graph. We use a grouping method to assign reads to different haplotype groups. PECAT efficiently assembles diploid genomes using Nanopore R9, PacBio CLR or Nanopore R10 reads only. PECAT generates more contiguous haplotype-specific contigs compared to other assemblers. Especially, PECAT achieves nearly haplotype-resolved assembly onB. taurus(Bison×Simmental) using Nanopore R9 reads and phase block NG50 with 59.4/58.0 Mb for HG002 using Nanopore R10 reads.more » « less