NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment

https://doi.org/10.1186/s13015-023-00247-x

Shen, Chengze; Liu, Baqiao; Williams, Kelly P.; Warnow, Tandy (December 2023, Algorithms for Molecular Biology)

Abstract BackgroundAdding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. ResultsWe present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment). EMMA builds on MAFFT--add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences. We show that EMMA has an accuracy advantage over other techniques for adding sequences into alignments under many realistic conditions and can scale to large datasets with high accuracy (hundreds of thousands of sequences). EMMA is available athttps://github.com/c5shen/EMMA. ConclusionsEMMA is a new tool that provides high accuracy and scalability for adding sequences into an existing alignment.
more » « less
HMMerge: an ensemble method for multiple sequence alignment

https://doi.org/10.1093/bioadv/vbad052

Park, Minhyuk; Warnow, Tandy; Lengauer, ed., Thomas (April 2023, Bioinformatics Advances)

Abstract MotivationDespite advances in method development for multiple sequence alignment over the last several decades, the alignment of datasets exhibiting substantial sequence length heterogeneity, especially when the input sequences include very short sequences (either as a result of sequencing technologies or of large deletions during evolution) remains an inadequately solved problem. ResultsWe present HMMerge, a method to compute an alignment of datasets exhibiting high sequence length heterogeneity, or to add short sequences into a given ‘backbone’ alignment. HMMerge builds on the technique from its predecessor alignment methods, UPP and WITCH, which build an ensemble of profile HMMs to represent the backbone alignment and add the remaining sequences into the backbone alignment using the ensemble. HMMerge differs from UPP and WITCH by building a new ‘merged’ HMM from the ensemble, and then using that merged HMM to align the query sequences. We show that HMMerge is competitive with WITCH, with an advantage over WITCH when adding very short sequences into backbone alignments. Availability and implementationHMMerge is freely available at https://github.com/MinhyukPark/HMMerge. Supplementary informationSupplementary data are available at Bioinformatics Advances online.
more » « less
UPP2: fast and accurate alignment of datasets with fragmentary sequences

https://doi.org/10.1093/bioinformatics/btad007

Park, Minhyuk; Ivanovic, Stefan; Chu, Gillian; Shen, Chengze; Warnow, Tandy; Marschall, ed., Tobias (January 2023, Bioinformatics)

Abstract MotivationMultiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. ResultsWe present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise. Availability and implementationhttps://github.com/gillichu/sepp. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
WITCH-NG: efficient and accurate alignment of datasets with sequence length heterogeneity

https://doi.org/10.1093/bioadv/vbad024

Liu, Baqiao; Warnow, Tandy; Lengauer, ed., Thomas (March 2023, Bioinformatics Advances)

Abstract SummaryMultiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy. In this article, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith–Waterman. Our new method, WITCH-NG (i.e. ‘next generation WITCH’) achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG. Availability and implementationThe datasets used in this study are from prior publications and are freely available in public repositories, as indicated in the Supplementary Materials. Supplementary informationSupplementary data are available at Bioinformatics Advances online.
more » « less
TIPP3 and TIPP3-fast: Improved abundance profiling in metagenomics

https://doi.org/10.1371/journal.pcbi.1012593

Shen, Chengze; Wedell, Eleanor; Pop, Mihai; Warnow, Tandy (April 2025, PLOS Computational Biology)
Zhu, Shanfeng (Ed.)
We present TIPP3 and TIPP3-fast, new tools for abundance profiling in metagenomic datasets. Like its predecessor, TIPP2, the TIPP3 pipeline uses a maximum likelihood approach to place reads into labeled taxonomies using marker genes, but it achieves superior accuracy to TIPP2 by enabling the use of much larger taxonomies through improved algorithmic techniques. We show that TIPP3 is generally more accurate than leading methods for abundance profiling in two important contexts: when reads come from genomes not already in a public database (i.e., novel genomes) and when reads contain sequencing errors. We also show that TIPP3-fast has slightly lower accuracy than TIPP3, but is also generally more accurate than other leading methods and uses a small fraction of TIPP3’s runtime. Additionally, we highlight the potential benefits of restricting abundance profiling methods to those reads that map to marker genes (i.e., using a filtered marker-gene based analysis), which we show typically improves accuracy. TIPP3 is freely available athttps://github.com/c5shen/TIPP3.
more » « less
Free, publicly-accessible full text available April 4, 2026
BSCAMPP: Batch-Scaled Phylogenetic Placement on Large Trees

https://doi.org/10.1109/TCBBIO.2025.3562281

Wedell, Eleanor; Shen, Chengze; Warnow, Tandy (January 2025, IEEE Transactions on Computational Biology and Bioinformatics)

Full Text Available
Large-Scale Multiple Sequence Alignment and the Maximum Weight Trace Alignment Merging Problem

https://doi.org/10.1109/TCBB.2022.3191848

Zaharias, Paul; Smirnov, Vladimir; Warnow, Tandy (May 2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics)
SCAMPP: Scaling Alignment-Based Phylogenetic Placement to Large Trees

https://doi.org/10.1109/TCBB.2022.3170386

Wedell, Eleanor; Cai, Yirong; Warnow, Tandy (March 2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics)

Full Text Available
Recent progress on methods for estimating and updating large phylogenies

https://doi.org/10.1098/rstb.2021.0244

Zaharias, Paul; Warnow, Tandy (October 2022, Philosophical Transactions of the Royal Society B: Biological Sciences)

With the increased availability of sequence data and even of fully sequenced and assembled genomes, phylogeny estimation of very large trees (even of hundreds of thousands of sequences) is now a goal for some biologists. Yet, the construction of these phylogenies is a complex pipeline presenting analytical and computational challenges, especially when the number of sequences is very large. In the past few years, new methods have been developed that aim to enable highly accurate phylogeny estimations on these large datasets, including divide-and-conquer techniques for multiple sequence alignment and/or tree estimation, methods that can estimate species trees from multi-locus datasets while addressing heterogeneity due to biological processes (e.g. incomplete lineage sorting and gene duplication and loss), and methods to add sequences into large gene trees or species trees. Here we present some of these recent advances and discuss opportunities for future improvements. This article is part of a discussion meeting issue ‘Genomic population structures of microbial pathogens’.
more » « less
Full Text Available
MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences

Shen, Chengze; Zaharias, Paul; Warnow, Tandy (January 2022, Bioinformatics)
Boeva, Valentina (Ed.)
Abstract Summary Multiple sequence alignment is an initial step in many bioinformatics pipelines, including phylogeny estimation, protein structure prediction and taxonomic identification of reads produced in amplicon or metagenomic datasets, etc. Yet, alignment estimation is challenging on datasets that exhibit substantial sequence length heterogeneity, and especially when the datasets have fragmentary sequences as a result of including reads or contigs generated by next-generation sequencing technologies. Here, we examine techniques that have been developed to improve alignment estimation when datasets contain substantial numbers of fragmentary sequences. We find that MAGUS, a recently developed MSA method, is fairly robust to fragmentary sequences under many conditions, and that using a two-stage approach where MAGUS is used to align selected ‘backbone sequences’ and the remaining sequences are added into the alignment using ensembles of Hidden Markov Models further improves alignment accuracy. The combination of MAGUS with the ensemble of eHMMs (i.e. MAGUS+eHMMs) clearly improves on UPP, the previous leading method for aligning datasets with high levels of fragmentation. Availability and implementation UPP is available on https://github.com/smirarab/sepp, and MAGUS is available on https://github.com/vlasmirnov/MAGUS. MAGUS+eHMMs can be performed by running MAGUS to obtain the backbone alignment, and then using the backbone alignment as an input to UPP. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available

« Prev Next »

Search for: All records