Long-read sequencing is revolutionizing
With populations of threatened and endangered species declining worldwide, efforts are being made to generate high quality genomic records of these species before they are lost forever. Here, we demonstrate that data from single Oxford Nanopore Technologies (ONT) MinION flow cells can, even in the absence of highly accurate short DNA-read polishing, produce high quality de novo plant genome assemblies adequate for downstream analyses, such as synteny and ploidy evaluations, paleodemographic analyses, and phylogenomics. This study focuses on three North American ash tree species in the genus
- Award ID(s):
- 2139311
- NSF-PAR ID:
- 10484561
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- Communications Biology
- Volume:
- 7
- Issue:
- 1
- ISSN:
- 2399-3642
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract de-novo genome assemblies, with continued advancements making it more readily available for previously understudied, non-model organisms. Stony corals are one such example, with long-readde-novo genome assemblies now starting to be publicly available, opening the door for a wide array of ‘omics-based research. Here we present a newde-novo genome assembly for the endangered Caribbean star coral,Orbicella faveolata , using PacBio circular consensus reads. Our genome assembly improved the contiguity (51 versus 1,933 contigs) and complete and single copy BUSCO orthologs (93.6% versus 85.3%, database metazoa_odb10), compared to the currently available reference genome generated using short-read methodologies. Our newde-novo assembled genome also showed comparable quality metrics to other coral long-read genomes. Telomeric repeat analysis identified putative chromosomes in our scaffolded assembly, with these repeats at either one, or both ends, of scaffolded contigs. We identified 32,172 protein coding genes in our assembly through use of long-read RNA sequencing (ISO-seq) of additionalO. faveolata fragments exposed to a range of abiotic and biotic treatments, and publicly available short-read RNA-seq data. With anthropogenic influences heavily affectingO. faveolata , as well as its increasing incorporation into reef restoration activities, this updated genome resource can be used for population genomics and other ‘omics analyses to aid in the conservation of this species. -
Abstract Summary Genomics has become an essential technology for surveilling emerging infectious disease outbreaks. A range of technologies and strategies for pathogen genome enrichment and sequencing are being used by laboratories worldwide, together with different and sometimes ad hoc, analytical procedures for generating genome sequences. A fully integrated analytical process for raw sequence to consensus genome determination, suited to outbreaks such as the ongoing COVID-19 pandemic, is critical to provide a solid genomic basis for epidemiological analyses and well-informed decision making. We have developed a web-based platform and integrated bioinformatic workflows that help to provide consistent high-quality analysis of SARS-CoV-2 sequencing data generated with either the Illumina or Oxford Nanopore Technologies (ONT). Using an intuitive web-based interface, this workflow automates data quality control, SARS-CoV-2 reference-based genome variant and consensus calling, lineage determination and provides the ability to submit the consensus sequence and necessary metadata to GenBank, GISAID and INSDC raw data repositories. We tested workflow usability using real world data and validated the accuracy of variant and lineage analysis using several test datasets, and further performed detailed comparisons with results from the COVID-19 Galaxy Project workflow. Our analyses indicate that EC-19 workflows generate high-quality SARS-CoV-2 genomes. Finally, we share a perspective on patterns and impact observed with Illumina versus ONT technologies on workflow congruence and differences.
Availability and implementation https://edge-covid19.edgebioinformatics.org, and https://github.com/LANL-Bioinformatics/EDGE/tree/SARS-CoV2.
Supplementary information Supplementary data are available at Bioinformatics online.
-
null (Ed.)Choosing the optimum assembly approach is essential to achieving a high-quality genome assembly suitable for comparative and evolutionary genomic investigations. Significant recent progress in long-read sequencing technologies such as PacBio and Oxford Nanopore Technologies (ONT) has also brought about a large variety of assemblers. Although these have been extensively tested on model species such as Homo sapiens and Drosophila melanogaster , such benchmarking has not been done in Mollusca, which lacks widely adopted model species. Molluscan genomes are notoriously rich in repeats and are often highly heterozygous, making their assembly challenging. Here, we benchmarked 10 assemblers based on ONT raw reads from two published molluscan genomes of differing properties, the gastropod Chrysomallon squamiferum (356.6 Mb, 1.59% heterozygosity) and the bivalve Mytilus coruscus (1593 Mb, 1.94% heterozygosity). By optimizing the assembly pipeline, we greatly improved both genomes from previously published versions. Our results suggested that 40–50X of ONT reads are sufficient for high-quality genomes, with Flye being the recommended assembler for compact and less heterozygous genomes exemplified by C. squamiferum , while NextDenovo excelled for more repetitive and heterozygous molluscan genomes exemplified by M. coruscus . A phylogenomic analysis using the two updated genomes with 32 other published high-quality lophotrochozoan genomes resulted in maximum support across all nodes, and we show that improved genome quality also leads to more complete matrices for phylogenomic inferences. Our benchmarking will ensure efficiency in future assemblies for molluscs and perhaps also for other marine phyla with few genomes available. This article is part of the Theo Murphy meeting issue ‘Molluscan genomics: broad insights and future directions for a neglected phylum’.more » « less
-
Abstract Although plastid genome (plastome) structure is highly conserved across most seed plants, investigations during the past two decades have revealed several disparately related lineages that experienced substantial rearrangements. Most plastomes contain a large inverted repeat and two single‐copy regions, and a few dispersed repeats; however, the plastomes of some taxa harbour long repeat sequences (>300 bp). These long repeats make it challenging to assemble complete plastomes using short‐read data, leading to misassemblies and consensus sequences with spurious rearrangements. Single‐molecule, long‐read sequencing has the potential to overcome these challenges, yet there is no consensus on the most effective method for accurately assembling plastomes using long‐read data. We generated a pipeline,
p last idG enomeA ssemblyU singL ong‐read data (ptGAUL), to address the problem of plastome assembly using long‐read data from Oxford Nanopore Technologies (ONT) or Pacific Biosciences platforms. We demonstrated the efficacy of the ptGAUL pipeline using 16 published long‐read data sets. We showed that ptGAUL quickly produces accurate and unbiased assemblies using only ~50× coverage of plastome data. Additionally, we deployed ptGAUL to assemble four newJuncus (Juncaceae) plastomes using ONT long reads. Our results revealed many long repeats and rearrangements inJuncus plastomes compared with basal lineages of Poales. The ptGAUL pipeline is available on GitHub:https://github.com/Bean061/ptgaul . -
Abstract Metagenomic Hi-C (metaHi-C) can identify contig-to-contig relationships with respect to their proximity within the same physical cell. Shotgun libraries in metaHi-C experiments can be constructed by next-generation sequencing (short-read metaHi-C) or more recent third-generation sequencing (long-read metaHi-C). However, all existing metaHi-C analysis methods are developed and benchmarked on short-read metaHi-C datasets and there exists much room for improvement in terms of more scalable and stable analyses, especially for long-read metaHi-C data. Here we report MetaCC, an efficient and integrative framework for analyzing both short-read and long-read metaHi-C datasets. MetaCC outperforms existing methods on normalization and binning. In particular, the MetaCC normalization module, named NormCC, is more than 3000 times faster than the current state-of-the-art method HiCzin on a complex wastewater dataset. When applied to one sheep gut long-read metaHi-C dataset, MetaCC binning module can retrieve 709 high-quality genomes with the largest species diversity using one single sample, including an expansion of five uncultured members from the order
Erysipelotrichales , and is the only binner that can recover the genome of one important speciesBacteroides vulgatus . Further plasmid analyses reveal that MetaCC binning is able to capture multi-copy plasmids.