NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fast multiple sequence alignment via multi-armed bandits

https://doi.org/10.1093/bioinformatics/btae225

Mazooji, Kayvon; Shomorony, Ilan (June 2024, Bioinformatics)

Abstract SummaryMultiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. Availability and implementationThe code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.
more » « less
LexicHash: sequence similarity estimation via lexicographic comparison of hashes

https://doi.org/10.1093/bioinformatics/btad652

Greenberg, Grant; Ravi, Aditya Narayan; Shomorony, Ilan (November 2023, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationPairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. ResultsIn this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how “lexicographically similar” the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision–recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search. Availability and implementationLexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
more » « less
Full Text Available
Dynamic DBSCAN with Euler Tour Sequences

Shin, Seiyun; Shomorony, Ilan; Macgregor, Peter (May 2025, Proceedings of Machine Learning Research)

Free, publicly-accessible full text available May 3, 2026
Prioritized Federated Learning: Leveraging Non-Priority Clients for Targeted Model Improvement

Ravi, Aditya Narayan; Shomorony, Ilan (September 2024, Transactions on machine learning research)

Full Text Available
Substring Density Estimation From Traces

https://doi.org/10.1109/TIT.2024.3418377

Mazooji, Kayvon; Shomorony, Ilan (August 2024, IEEE Transactions on Information Theory)
Capacity of Frequency-based Channels: Encoding Information in Molecular Concentrations

Gerzon, Yuval; Shomorony, Ilan; Weinberger, Nir (July 2024, Proceedings of IEEE International Symposium on Information Theory)

Full Text Available
Faster Maximum Inner Product Search in High Dimensions

Tiwari, Mo; Kang, Ryan; Lee, Jaeyong; Lee, Donghyun; Piech, Christopher J; Thrun, Sebastian; Shomorony, Ilan; Zhang, Martin J (June 2024, ICML 2024)

Full Text Available
An Information Theory for Out-of-Order Media With Applications in DNA Data Storage

https://doi.org/10.1109/TMBMC.2024.3403759

Ravi, Aditya Narayan; Vahid, Alireza; Shomorony, Ilan (June 2024, IEEE Transactions on Molecular, Biological, and Multi-Scale Communications)

Full Text Available
An Instance-Based Approach to the Trace Reconstruction Problem

https://doi.org/10.1109/CISS59072.2024.10480213

Mazooji, Kayvon; Shomorony, Ilan (March 2024, Proceedings of the Conference on Information Sciences and Systems)

Full Text Available
DNA Merge-Sort: A Family of Nested Varshamov-Tenengolts Reassembly Codes for Out-of-Order Media

https://doi.org/10.1109/TCOMM.2023.3335409

Nassirpour, Sajjad; Shomorony, Ilan; Vahid, Alireza (March 2024, IEEE Transactions on Communications)

Full Text Available

« Prev Next »

Search for: All records