skip to main content


Title: Vargas: heuristic-free alignment for assessing linear and graph read aligners
Abstract Motivation Read alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Results Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Availability and implementation Source code implemented in C ++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1350041 1920103
NSF-PAR ID:
10157427
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Kelso, Janet (Ed.)
    Abstract Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists. Results In this article, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability and implementation All the data used for preparing the results of this paper can be found with 10.5281/zenodo.4902332. PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Abstract

    Accurate alignment of transcribed RNA to reference genomes is a critical step in the analysis of gene expression, which in turn has broad applications in biomedical research and in the basic sciences. We reveal that widely used splice-aware aligners, such as STAR and HISAT2, can introduce erroneous spliced alignments between repeated sequences, leading to the inclusion of falsely spliced transcripts in RNA-seq experiments. In some cases, the ‘phantom’ introns resulting from these errors make their way into widely-used genome annotation databases. To address this issue, we present EASTR (Emending Alignments of Spliced Transcript Reads), a software tool that detects and removes falsely spliced alignments or transcripts from alignment and annotation files. EASTR improves the accuracy of spliced alignments across diverse species, including human, maize, andArabidopsis thaliana, by detecting sequence similarity between intron-flanking regions. We demonstrate that applying EASTR before transcript assembly substantially reduces false positive introns, exons, and transcripts, improving the overall accuracy of assembled transcripts. Additionally, we show that EASTR’s application to reference annotation databases can detect and correct likely cases of mis-annotated transcripts.

     
    more » « less
  3. Boeva, Valentina (Ed.)
    Abstract Summary Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation. Availability and implementation UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Abstract Summary

    Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation.

    Availability and implementation

    UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    Whole metagenome shotgun sequencing is a powerful approach for assaying the functional potential of microbial communities. We currently lack tools that efficiently and accurately align DNA reads against protein references, the technique necessary for constructing a functional profile. Here, we present PALADIN—a novel modification of the Burrows-Wheeler Aligner that provides accurate alignment, robust reporting capabilities and orders-of-magnitude improved efficiency by directly mapping in protein space.

    Results

    We compared the accuracy and efficiency of PALADIN against existing tools that employ nucleotide or protein alignment algorithms. Using simulated reads, PALADIN consistently outperformed the popular DNA read mappers BWA and NovoAlign in detected proteins, percentage of reads mapped and ontological similarity. We also compared PALADIN against four existing protein alignment tools: BLASTX, RAPSearch2, DIAMOND and Lambda, using empirically obtained reads. PALADIN yielded results seven times faster than the best performing alternative, DIAMOND and nearly 8000 times faster than BLASTX. PALADIN's accuracy was comparable to all tested solutions.

    Availability and Implementation

    PALADIN was implemented in C, and its source code and documentation are available at https://github.com/twestbrookunh/paladin

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less