skip to main content


Title: Swapping Metagenomics Preprocessing Pipeline Components Offers Speed and Sensitivity Increases
ABSTRACT Increasing data volumes on high-throughput sequencing instruments such as the NovaSeq 6000 leads to long computational bottlenecks for common metagenomics data preprocessing tasks such as adaptor and primer trimming and host removal. Here, we test whether faster recently developed computational tools (Fastp and Minimap2) can replace widely used choices (Atropos and Bowtie2), obtaining dramatic accelerations with additional sensitivity and minimal loss of specificity for these tasks. Furthermore, the taxonomic tables resulting from downstream processing provide biologically comparable results. However, we demonstrate that for taxonomic assignment, Bowtie2’s specificity is still required. We suggest that periodic reevaluation of pipeline components, together with improvements to standardized APIs to chain them together, will greatly enhance the efficiency of common bioinformatics tasks while also facilitating incorporation of further optimized steps running on GPUs, FPGAs, or other architectures. We also note that a detailed exploration of available algorithms and pipeline components is an important step that should be taken before optimization of less efficient algorithms on advanced or nonstandard hardware. IMPORTANCE In shotgun metagenomics studies that seek to relate changes in microbial DNA across samples, processing the data on a computer often takes longer than obtaining the data from the sequencing instrument. Recently developed software packages that perform individual steps in the pipeline of data processing in principle offer speed advantages, but in practice they may contain pitfalls that prevent their use, for example, they may make approximations that introduce unacceptable errors in the data. Here, we show that differences in choices of these components can speed up overall data processing by 5-fold or more on the same hardware while maintaining a high degree of correctness, greatly reducing the time taken to interpret results. This is an important step for using the data in clinical settings, where the time taken to obtain the results may be critical for guiding treatment.  more » « less
Award ID(s):
2120019
NSF-PAR ID:
10347860
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; « less
Editor(s):
Mackelprang, Rachel
Date Published:
Journal Name:
mSystems
Volume:
7
Issue:
2
ISSN:
2379-5077
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Environmental DNA (eDNA) metabarcoding is a promising method to monitor species and community diversity that is rapid, affordable and non‐invasive. The longstanding needs of the eDNA community are modular informatics tools, comprehensive and customizable reference databases, flexibility across high‐throughput sequencing platforms, fast multilocus metabarcode processing and accurate taxonomic assignment. Improvements in bioinformatics tools make addressing each of these demands within a single toolkit a reality.

    The new modular metabarcode sequence toolkitAnacapa(https://github.com/limey-bean/Anacapa/) addresses the above needs, allowing users to build comprehensive reference databases and assign taxonomy to raw multilocus metabarcode sequence data. A novel aspect ofAnacapais its database building module, “Creating Reference libraries Using eXisting tools” (CRUX), which generates comprehensive reference databases for specific user‐defined metabarcoding loci. TheQuality Control and ASV Parsingmodule sorts and processes multiple metabarcoding loci and processes merged, unmerged and unpaired reads maximizing recovered diversity.DADA2then detects amplicon sequence variants (ASVs) and theAnacapa Classifiermodule aligns these ASVs toCRUX‐generated reference databases usingBowtie2. Lastly, taxonomy is assigned to ASVs with confidence scores using a Bayesian Lowest Common Ancestor (BLCA) method. TheAnacapa Toolkitalso includes anrpackage,ranacapa, for automated results exploration through standard biodiversity statistical analysis.

    Benchmarking tests verify that theAnacapa Toolkiteffectively and efficiently generates comprehensive reference databases that capture taxonomic diversity, and can assign taxonomy to both MiSeq and HiSeq‐length sequence data. We demonstrate the value of theAnacapa Toolkitin assigning taxonomy to seawater eDNA samples collected in southern California.

    TheAnacapa Toolkitimproves the functionality of eDNA and streamlines biodiversity assessment and management by generating metabarcode specific databases, processing multilocus data, retaining a larger proportion of sequencing reads and expanding non‐traditional eDNA targets. All the components of theAnacapa Toolkitare open and available in a virtual container to ease installation.

     
    more » « less
  2. Fraser, Claire M. (Ed.)
    ABSTRACT

    Metagenomics is a powerful method for interpreting the ecological roles and physiological capabilities of mixed microbial communities. Yet, many tools for processing metagenomic data are neither designed to consider eukaryotes nor are they built for an increasing amount of sequence data. EukHeist is an automated pipeline to retrieve eukaryotic and prokaryotic metagenome-assembled genomes (MAGs) from large-scale metagenomic sequence data sets. We developed the EukHeist workflow to specifically process large amounts of both metagenomic and/or metatranscriptomic sequence data in an automated and reproducible fashion. Here, we applied EukHeist to the large-size fraction data (0.8–2,000 µm) from Tara Oceans to recover both eukaryotic and prokaryotic MAGs, which we refer to as TOPAZ (Tara Oceans Particle-Associated MAGs). The TOPAZ MAGs consisted of >900 environmentally relevant eukaryotic MAGs and >4,000 bacterial and archaeal MAGs. The bacterial and archaeal TOPAZ MAGs expand upon the phylogenetic diversity of likely particle- and host-associated taxa. We use these MAGs to demonstrate an approach to infer the putative trophic mode of the recovered eukaryotic MAGs. We also identify ecological cohorts of co-occurring MAGs, which are driven by specific environmental factors and putative host-microbe associations. These data together add to a number of growing resources of environmentally relevant eukaryotic genomic information. Complementary and expanded databases of MAGs, such as those provided through scalable pipelines like EukHeist, stand to advance our understanding of eukaryotic diversity through increased coverage of genomic representatives across the tree of life.

    IMPORTANCE

    Single-celled eukaryotes play ecologically significant roles in the marine environment, yet fundamental questions about their biodiversity, ecological function, and interactions remain. Environmental sequencing enables researchers to document naturally occurring protistan communities, without culturing bias, yet metagenomic and metatranscriptomic sequencing approaches cannot separate individual species from communities. To more completely capture the genomic content of mixed protistan populations, we can create bins of sequences that represent the same organism (metagenome-assembled genomes [MAGs]). We developed the EukHeist pipeline, which automates the binning of population-level eukaryotic and prokaryotic genomes from metagenomic reads. We show exciting insight into what protistan communities are present and their trophic roles in the ocean. Scalable computational tools, like EukHeist, may accelerate the identification of meaningful genetic signatures from large data sets and complement researchers’ efforts to leverage MAG databases for addressing ecological questions, resolving evolutionary relationships, and discovering potentially novel biodiversity.

     
    more » « less
  3. Abstract Background

    With the advent of metagenomics, the importance of microorganisms and how their interactions are relevant to ecosystem resilience, sustainability, and human health has become evident. Cataloging and preserving biodiversity is paramount not only for the Earth’s natural systems but also for discovering solutions to challenges that we face as a growing civilization. Metagenomics pertains to the in silico study of all microorganisms within an ecological community in situ,however, many software suites recover only prokaryotes and have limited to no support for viruses and eukaryotes.

    Results

    In this study, we introduce theViral Eukaryotic Bacterial Archaeal(VEBA) open-source software suite developed to recover genomes from all domains. To our knowledge,VEBAis the first end-to-end metagenomics suite that can directly recover, quality assess, and classify prokaryotic, eukaryotic, and viral genomes from metagenomes.VEBAimplements a novel iterative binning procedure and hybrid sample-specific/multi-sample framework that yields more genomes than any existing methodology alone.VEBAincludes a consensus microeukaryotic database containing proteins from existing databases to optimize microeukaryotic gene modeling and taxonomic classification.VEBAalso provides a unique clustering-based dereplication strategy allowing for sample-specific genomes and genes to be directly compared across non-overlapping biological samples. Finally,VEBAis the only pipeline that automates the detection of candidate phyla radiation bacteria and implements the appropriate genome quality assessments.VEBA’s capabilities are demonstrated by reanalyzing 3 existing public datasets which recovered a total of 948 MAGs (458 prokaryotic, 8 eukaryotic, and 482 viral) including several uncharacterized organisms and organisms with no public genome representatives.

    Conclusions

    TheVEBAsoftware suite allows for the in silico recovery of microorganisms from all domains of life by integrating cutting edge algorithms in novel ways.VEBAfully integrates both end-to-end and task-specific metagenomic analysis in a modular architecture that minimizes dependencies and maximizes productivity. The contributions ofVEBAto the metagenomics community includes seamless end-to-end metagenomics analysis but also provides users with the flexibility to perform specific analytical tasks.VEBAallows for the automation of several metagenomics steps and shows that new information can be recovered from existing datasets.

     
    more » « less
  4. null (Ed.)
    Abstract As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions. 
    more » « less
  5. Abstract Summary

    Over the past decade, short-read sequence alignment has become a mature technology. Optimized algorithms, careful software engineering and high-speed hardware have contributed to greatly increased throughput and accuracy. With these improvements, many opportunities for performance optimization have emerged. In this review, we examine three general-purpose short-read alignment tools—BWA-MEM, Bowtie 2 and Arioc—with a focus on performance optimization. We analyze the performance-related behavior of the algorithms and heuristics each tool implements, with the goal of arriving at practical methods of improving processing speed and accuracy. We indicate where an aligner's default behavior may result in suboptimal performance, explore the effects of computational constraints such as end-to-end mapping and alignment scoring threshold, and discuss sources of imprecision in the computation of alignment scores and mapping quality. With this perspective, we describe an approach to tuning short-read aligner performance to meet specific data-analysis and throughput requirements while avoiding potential inaccuracies in subsequent analysis of alignment results. Finally, we illustrate how this approach avoids easily overlooked pitfalls and leads to verifiable improvements in alignment speed and accuracy.

    Contact

    richard.wilton@jhu.edu

    Supplementary information

    Appendices referenced in this article are available at Bioinformatics online.

     
    more » « less