skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

Title: Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software

Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely onde novoassembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from theArabidopsis thalianaandHomo sapiensgenomes and comparedde novoassemblies by six software programs that are commonly used or promising for this purpose (ABySS,CD‐HIT,Stacks,Stacks2,VelvetandVSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings.ABySSfailed to recover any true genome fragments, andVelvetandVSEARCHperformed poorly for most simulations.StacksandStacks2produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance.CD‐HITwas the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
Molecular Ecology Resources
Page Range / eLocation ID:
p. 360-370
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The emergence of third‐generation sequencing (3GS; long‐reads) is bringing closer the goal of chromosome‐size fragments in de novo genome assemblies. This allows the exploration of new and broader questions on genome evolution for a number of nonmodel organisms. However, long‐read technologies result in higher sequencing error rates and therefore impose an elevated cost of sufficient coverage to achieve high enough quality. In this context, hybrid assemblies, combining short‐reads and long‐reads, provide an alternative efficient and cost‐effective approach to generate de novo, chromosome‐level genome assemblies. The array of available software programs for hybrid genome assembly, sequence correction and manipulation are constantly being expanded and improved. This makes it difficult for nonexperts to find efficient, fast and tractable computational solutions for genome assembly, especially in the case of nonmodel organisms lacking a reference genome or one from a closely related species. In this study, we review and test the most recent pipelines for hybrid assemblies, comparing the model organismDrosophila melanogasterto a nonmodel cactophilicDrosophila,D. mojavensis. We show that it is possible to achieve excellent contiguity on this nonmodel organism using thedbg2olcpipeline.

    more » « less
  2. Abstract

    Transcriptome quality control is an important step in RNA‐Seq experiments. However, the quality of de novo assembled transcriptomes is difficult to assess, due to the lack of reference genome to compare the assembly to. We developed a method to assess and improve the quality of de novo assembled transcriptomes by focusing on the removal of chimeric sequences. These chimeric sequences can be the result of faulty assembled contigs, merging two transcripts into one. The developed method is incorporated into a pipeline, which we named Bellerophon, that is broadly applicable and easy to use. Bellerophon first uses the quality assessment tool TransRate to indicate the quality, after which it uses a transcripts per million (TPM) filter to remove lowly expressed contigs and CD‐HIT‐EST to remove highly identical contigs. To validate the quality of this method, we performed three benchmark experiments: (1) a computational creation of chimeras, (2) identification of chimeric contigs in a transcriptome assembly, (3) a simulated RNA‐Seq experiment using a known reference transcriptome. Overall, the Bellerophon pipeline was able to remove between 40% and 91.9% of the chimeras in transcriptome assemblies and removed more chimeric than nonchimeric contigs. Thus, the Bellerophon sequence of filtration steps is a broadly applicable solution to improve transcriptome assemblies.

    more » « less
  3. Abstract

    Long-read sequencing is revolutionizingde-novogenome assemblies, with continued advancements making it more readily available for previously understudied, non-model organisms. Stony corals are one such example, with long-readde-novogenome assemblies now starting to be publicly available, opening the door for a wide array of ‘omics-based research. Here we present a newde-novogenome assembly for the endangered Caribbean star coral,Orbicella faveolata, using PacBio circular consensus reads. Our genome assembly improved the contiguity (51 versus 1,933 contigs) and complete and single copy BUSCO orthologs (93.6% versus 85.3%, database metazoa_odb10), compared to the currently available reference genome generated using short-read methodologies. Our newde-novoassembled genome also showed comparable quality metrics to other coral long-read genomes. Telomeric repeat analysis identified putative chromosomes in our scaffolded assembly, with these repeats at either one, or both ends, of scaffolded contigs. We identified 32,172 protein coding genes in our assembly through use of long-read RNA sequencing (ISO-seq) of additionalO. faveolatafragments exposed to a range of abiotic and biotic treatments, and publicly available short-read RNA-seq data. With anthropogenic influences heavily affectingO. faveolata, as well as itsincreasing incorporation into reef restoration activities, this updated genome resource can be used for population genomics and other ‘omics analyses to aid in the conservation of this species.

    more » « less
  4. Background

    Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enablingde novoassembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes.


    Here we evaluatede novoassembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10 kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes.


    Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥10 kb by 10 to 100-fold for low input metagenomes.


    PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improvedde novogenome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.

    more » « less
  5. Abstract Background

    Systems-level analyses, such as differential gene expression analysis, co-expression analysis, and metabolic pathway reconstruction, depend on the accuracy of the transcriptome. Multiple tools exist to perform transcriptome assembly from RNAseq data. However, assembling high quality transcriptomes is still not a trivial problem. This is especially the case for non-model organisms where adequate reference genomes are often not available. Different methods produce different transcriptome models and there is no easy way to determine which are more accurate. Furthermore, having alternative-splicing events exacerbates such difficult assembly problems. While benchmarking transcriptome assemblies is critical, this is also not trivial due to the general lack of true reference transcriptomes.


    In this study, we first provide a pipeline to generate a set of the simulated benchmark transcriptome and corresponding RNAseq data. Using the simulated benchmarking datasets, we compared the performance of various transcriptome assembly approaches including both de novo and genome-guided methods. The results showed that the assembly performance deteriorates significantly when alternative transcripts (isoforms) exist or for genome-guided methods when the reference is not available from the same genome. To improve the transcriptome assembly performance, leveraging the overlapping predictions between different assemblies, we present a new consensus-based ensemble transcriptome assembly approach, ConSemble.


    Without using a reference genome, ConSemble using four de novo assemblers achieved an accuracy up to twice as high as any de novo assemblers we compared. When a reference genome is available, ConSemble using four genome-guided assemblies removed many incorrectly assembled contigs with minimal impact on correctly assembled contigs, achieving higher precision and accuracy than individual genome-guided methods. Furthermore, ConSemble using de novo assemblers matched or exceeded the best performing genome-guided assemblers even when the transcriptomes included isoforms. We thus demonstrated that the ConSemble consensus strategy both for de novo and genome-guided assemblers can improve transcriptome assembly. The RNAseq simulation pipeline, the benchmark transcriptome datasets, and the script to perform the ConSemble assembly are all freely available from:

    more » « less