skip to main content

This content will become publicly available on January 1, 2023

Title: WITCH: Improved Multiple Sequence Alignment Through Weighted Consensus Hidden Markov Model Alignment
Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers ‘‘full-length,’’ represents this ‘‘backbone alignment’’ using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses k > 1 HMMs from the ensemble rather than a single more » HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments. « less
Authors:
; ;
Award ID(s):
2006069
Publication Date:
NSF-PAR ID:
10328453
Journal Name:
Journal of Computational Biology
Volume:
29
Issue:
0
Page Range or eLocation-ID:
1-20
ISSN:
1557-8666
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Summary

    Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences undermore »many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation.

    Availability and implementation

    UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  2. Boeva, Valentina (Ed.)
    Summary Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences undermore »many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation. Availability and implementation UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP. Supplementary information Supplementary data are available at Bioinformatics online.« less
  3. Pairwise sequence alignment is one of the most computationally intensive kernels in genomic data analysis, accounting for more than 90% of the runtime for key bioinformatics applications. This method is particularly expensive for third generation sequences due to the high computational cost of analyzing sequences of length between 1Kb and 1Mb. Given the quadratic overhead of exact pairwise algorithms for long alignments, the community primarily relies on approximate algorithms that search only for high-quality alignments and stop early when one is not found. In this work, we present the first GPU optimization of the popular X-drop alignment algorithm, that wemore »named LOGAN. Results show that our high performance multi-GPU implementation achieves up to 181.6 GCUPS and speed-ups up to 6.6 and 30.7 using 1 and 6 NVIDIA Tesla V100, respectively, over the state-of-the-art software running on two IBM Power9 processors using 168 CPU threads, with equivalent accuracy. We also demonstrate a 2.3 LOGAN speed-up versus ksw2, a state-of-art vectorized algorithm for sequence alignment implemented in minimap2, a long-read mapping software. To highlight the impact of our work on a real-world application, we couple LOGAN with a many-to-many long-read alignment software called BELLA, and demonstrate that our implementation improves the overall BELLA runtime by up to 10.6. Finally, we adapt the Roofline model for LOGAN and demonstrate that our implementation is near optimal on the NVIDIA Tesla V100s.« less
  4. Aligning DNA sequences to an annotated reference is a key step for genotyping in biology. Recent scientific studies have demonstrated improved inference by aligning reads to a variation graph, i.e., a reference sequence augmented with known genetic variations. Given a variation graph in the form of a directed acyclic string graph, the sequence to graph alignment problem seeks to find the best matching path in the graph for an input query sequence. Solving this problem exactly using a sequential dynamic programming algorithm takes quadratic time in terms of the graph size and query length, making it difficult to scale tomore »high throughput DNA sequencing data. In this work, we propose the first parallel algorithm for computing sequence to graph alignments that leverages multiple cores and single-instruction multiple-data (SIMD) operations. We take advantage of the available inter-task parallelism, and provide a novel blocked approach to compute the score matrix while ensuring high memory locality. Using a 48-core Intel Xeon Skylake processor, the proposed algorithm achieves peak performance of 317 billion cell updates per second (GCUPS), and demonstrates near linear weak and strong scaling on up to 48 cores. It delivers significant performance gains compared to existing algorithms, and results in run-time reduction from multiple days to three hours for the problem of optimally aligning high coverage long (PacBio/ONT) or short (Illumina) DNA reads to an MHC human variation graph containing 10 million vertices.« less
  5. Abstract Background Inertial measurement units (IMUs) with high-resolution sensors such as accelerometers are now used extensively to study fine-scale behavior in a wide range of marine and terrestrial animals. Robust and practical methods are required for the computationally-demanding analysis of the resulting large datasets, particularly for automating classification routines that construct behavioral time series and time-activity budgets. Magnetometers are used increasingly to study behavior, but it is not clear how these sensors contribute to the accuracy of behavioral classification methods. Development of effective  classification methodology is key to understanding energetic and life-history implications of foraging and other behaviors. Methods We deployed accelerometersmore »and magnetometers on four species of free-ranging albatrosses and evaluated the ability of unsupervised hidden Markov models (HMMs) to identify three major modalities in their behavior: ‘flapping flight’, ‘soaring flight’, and ‘on-water’. The relative contribution of each sensor to classification accuracy was measured by comparing HMM-inferred states with expert classifications identified from stereotypic patterns observed in sensor data. Results HMMs provided a flexible and easily interpretable means of classifying behavior from sensor data. Model accuracy was high overall (92%), but varied across behavioral states (87.6, 93.1 and 91.7% for ‘flapping flight’, ‘soaring flight’ and ‘on-water’, respectively). Models built on accelerometer data alone were as accurate as those that also included magnetometer data; however, the latter were useful for investigating slow and periodic behaviors such as dynamic soaring at a fine scale. Conclusions The use of IMUs in behavioral studies produces large data sets, necessitating the development of computationally-efficient methods to automate behavioral classification in order to synthesize and interpret underlying patterns. HMMs provide an accessible and robust framework for analyzing complex IMU datasets and comparing behavioral variation among taxa across habitats, time and space.« less