skip to main content

Title: Evaluating Illumina-, Nanopore-, and PacBio-based genome assembly strategies with the bald notothen, Trematomus borchgrevinki
Abstract For any genome-based research, a robust genome assembly is required. De novo assembly strategies have evolved with changes in DNA sequencing technologies and have been through at least three phases: i) short-read only, ii) short- and long-read hybrid, and iii) long-read only assemblies. Each of the phases has their own error model. We hypothesized that hidden scaffolding errors in short-read assembly and erroneous long-read contigs degrades the quality of short- and long-read hybrid assemblies. We assembled the genome of T. borchgrevinki from data generated during each of the three phases and assessed the quality problems we encountered. We developed strategies such as k-mer-assembled region replacement, parameter optimization, and long-read sampling to address the error models. We demonstrated that a k-mer based strategy improved short-read assemblies as measured by BUSCO while mate-pair libraries introduced hidden scaffolding errors and perturbed BUSCO scores. Further, we found that although hybrid assemblies can generate higher contiguity they tend to suffer from lower quality. In addition, we found long-read only assemblies can be optimized for contiguity by sub-sampling length-restricted raw reads. Our results indicate that long-read contig assembly is the current best choice and that assemblies from phase I and phase II were of lower quality.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
G3 Genes|Genomes|Genetics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Shao, Mingfu (Ed.)
    Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at . 
    more » « less
  2. null (Ed.)
    Abstract Genome sequences provide genomic maps with a single-base resolution for exploring genetic contents. Sequencing technologies, particularly long reads, have revolutionized genome assemblies for producing highly continuous genome sequences. However, current long-read sequencing technologies generate inaccurate reads that contain many errors. Some errors are retained in assembled sequences, which are typically not completely corrected by using either long reads or more accurate short reads. The issue commonly exists, but few tools are dedicated for computing error rates or determining error locations. In this study, we developed a novel approach, referred to as k-mer abundance difference (KAD), to compare the inferred copy number of each k-mer indicated by short reads and the observed copy number in the assembly. Simple KAD metrics enable to classify k-mers into categories that reflect the quality of the assembly. Specifically, the KAD method can be used to identify base errors and estimate the overall error rate. In addition, sequence insertion and deletion as well as sequence redundancy can also be detected. Collectively, KAD is valuable for quality evaluation of genome assemblies and, potentially, provides a diagnostic tool to aid in precise error correction. KAD software has been developed to facilitate public uses. 
    more » « less
  3. null (Ed.)
    Background Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. To date, viromics benchmarking has covered read pre-processing, assembly, relative abundance, read mapping thresholds and diversity estimation, but other steps would benefit from benchmarking and standardization. Here we use in silico-generated datasets and an extensive literature survey to evaluate and highlight how dataset composition (i.e., viromes vs bulk metagenomes) and assembly fragmentation impact (i) viral contig identification tool, (ii) virus taxonomic classification, and (iii) identification and curation of auxiliary metabolic genes (AMGs). Results The in silico benchmarking of five commonly used virus identification tools show that gene-content-based tools consistently performed well for long (≥3 kbp) contigs, while k -mer- and blast-based tools were uniquely able to detect viruses from short (≤3 kbp) contigs. Notably, however, the performance increase of k -mer- and blast-based tools for short contigs was obtained at the cost of increased false positives (sometimes up to ∼5% for virome and ∼75% bulk samples), particularly when eukaryotic or mobile genetic element sequences were included in the test datasets. For viral classification, variously sized genome fragments were assessed using gene-sharing network analytics to quantify drop-offs in taxonomic assignments, which revealed correct assignations ranging from ∼95% (whole genomes) down to ∼80% (3 kbp sized genome fragments). A similar trend was also observed for other viral classification tools such as VPF-class, ViPTree and VIRIDIC, suggesting that caution is warranted when classifying short genome fragments and not full genomes. Finally, we highlight how fragmented assemblies can lead to erroneous identification of AMGs and outline a best-practices workflow to curate candidate AMGs in viral genomes assembled from metagenomes. Conclusion Together, these benchmarking experiments and annotation guidelines should aid researchers seeking to best detect, classify, and characterize the myriad viruses ‘hidden’ in diverse sequence datasets. 
    more » « less
  4. Abstract

    Genomic resources across squamate reptiles (lizards and snakes) have lagged behind other vertebrate systems and high-quality reference genomes remain scarce. Of the 23 chromosome-scale reference genomes across the order, only 12 of the ~60 squamate families are represented. Within geckos (infraorder Gekkota), a species-rich clade of lizards, chromosome-level genomes are exceptionally sparse representing only two of the seven extant families. Using the latest advances in genome sequencing and assembly methods, we generated one of the highest-quality squamate genomes to date for the leopard gecko, Eublepharis macularius (Eublepharidae). We compared this assembly to the previous, short-read only, E. macularius reference genome published in 2016 and examined potential factors within the assembly influencing contiguity of genome assemblies using PacBio HiFi data. Briefly, the read N50 of the PacBio HiFi reads generated for this study was equal to the contig N50 of the previous E. macularius reference genome at 20.4 kilobases. The HiFi reads were assembled into a total of 132 contigs, which was further scaffolded using HiC data into 75 total sequences representing all 19 chromosomes. We identified 9 of the 19 chromosomal scaffolds were assembled as a near-single contig, whereas the other 10 chromosomes were each scaffolded together from multiple contigs. We qualitatively identified that the percent repeat content within a chromosome broadly affects its assembly contiguity prior to scaffolding. This genome assembly signifies a new age for squamate genomics where high-quality reference genomes rivaling some of the best vertebrate genome assemblies can be generated for a fraction of previous cost estimates. This new E. macularius reference assembly is available on NCBI at JAOPLA010000000.

    more » « less
  5. Abstract

    Long-read sequencing is revolutionizingde-novogenome assemblies, with continued advancements making it more readily available for previously understudied, non-model organisms. Stony corals are one such example, with long-readde-novogenome assemblies now starting to be publicly available, opening the door for a wide array of ‘omics-based research. Here we present a newde-novogenome assembly for the endangered Caribbean star coral,Orbicella faveolata, using PacBio circular consensus reads. Our genome assembly improved the contiguity (51 versus 1,933 contigs) and complete and single copy BUSCO orthologs (93.6% versus 85.3%, database metazoa_odb10), compared to the currently available reference genome generated using short-read methodologies. Our newde-novoassembled genome also showed comparable quality metrics to other coral long-read genomes. Telomeric repeat analysis identified putative chromosomes in our scaffolded assembly, with these repeats at either one, or both ends, of scaffolded contigs. We identified 32,172 protein coding genes in our assembly through use of long-read RNA sequencing (ISO-seq) of additionalO. faveolatafragments exposed to a range of abiotic and biotic treatments, and publicly available short-read RNA-seq data. With anthropogenic influences heavily affectingO. faveolata, as well as itsincreasing incorporation into reef restoration activities, this updated genome resource can be used for population genomics and other ‘omics analyses to aid in the conservation of this species.

    more » « less