skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Vargas: heuristic-free alignment for assessing linear and graph read aligners
Abstract Motivation Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Results Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Availability and implementation Source code implemented in C ++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1350041 1920103
PAR ID:
10157427
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Kelso, Janet (Ed.)
    Abstract Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists. Results In this article, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability and implementation All the data used for preparing the results of this paper can be found with 10.5281/zenodo.4902332. PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Aligning to a linear reference genome can result in a higher percentage of reads going unmapped or being incorrectly mapped owing to variations not captured by the reference, otherwise known as reference bias. Recently, in efforts to mitigate reference bias, there has been a movement to switch to using pangenomes, a collection of genomes, as the reference. In this paper, we introduce Moni-align, the first short-read pangenome aligner built on the r-index, a variation of the classical FM-index that can index collections of genomes in O(r)-space, whereris the number of runs in the Burrows–Wheeler transform. Moni-align uses a seed-and-extend strategy for aligning reads, utilizing maximal exact matches as seeds, which can be efficiently obtained with ther-index. Using both simulated and real short-read data sets, we demonstrate that Moni-align achieves alignment accuracy comparable to vg map and vg giraffe, the leading pangenome aligners. Although currently best suited for aligning to localized pangenomes owing to computational constraints, Moni-align offers a robust foundation for future optimizations that could further broaden its applicability. 
    more » « less
  3. Boeva, Valentina (Ed.)
    Abstract Summary Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation. Availability and implementation UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Martelli, Pier Luigi (Ed.)
    Abstract MotivationPairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA’s O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. ResultsIn this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA’s time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. Availability and implementationAll code is publicly available at https://github.com/smarco/BiWFA-paper. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract Motivation Oxford Nanopore Technologies sequencing devices support adaptive sequencing, in which undesired reads can be ejected from a pore in real time. This feature allows targeted sequencing aided by computational methods for mapping partial reads, rather than complex library preparation protocols. However, existing mapping methods either require a computationally expensive base-calling procedure before using aligners to map partial reads or work well only on small genomes. Results In this work, we present a new streaming method that can map nanopore raw signals for real-time selective sequencing. Rather than converting read signals to bases, we propose to convert reference genomes to signals and fully operate in the signal space. Our method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored toward the current signal characteristics. We implemented the method as a tool Sigmap. Then we evaluated it on both simulated and real data and compared it to the state-of-the-art nanopore raw signal mapper Uncalled. Our results show that Sigmap yields comparable performance on mapping yeast simulated raw signals, and better mapping accuracy on mapping yeast real raw signals with a 4.4× speedup. Moreover, our method performed well on mapping raw signals to genomes of size >100 Mbp and correctly mapped 11.49% more real raw signals of green algae, which leads to a significantly higher F1-score (0.9354 versus 0.8660). Availability and implementation Sigmap code is accessible at https://github.com/haowenz/sigmap. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less