NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Heaviest Induced Ancestors Problem: Better Data Structures and Applications

https://doi.org/10.1007/s00453-022-00955-7

Abedin, Paniz; Hooshmand, Sahar; Ganguly, Arnab; Thankachan, Sharma V. (July 2022, Algorithmica)

Full Text Available
On the Complexity of Recognizing Wheeler Graphs

https://doi.org/10.1007/s00453-021-00917-5

Gibney, Daniel; Thankachan, Sharma V. (March 2022, Algorithmica)

Full Text Available
An Ultra-Fast and Parallelizable Algorithm for Finding $$k$$-Mismatch Shortest Unique Substrings

https://doi.org/10.1109/TCBB.2020.2968531

Allen, Daniel R.; Thankachan, Sharma V.; Xu, Bojian (January 2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics)
null (Ed.)
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log^k n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n^2) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log^k n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.
more » « less
Full Text Available
Simple Reductions from Formula-SAT to Pattern Matching on Labeled Graphs and Subtree Isomorphism

https://doi.org/10.1137/1.9781611976496.26

Gibney, Daniel; Hoppenworth, Gary; V. Thankachan, Sharma (January 2021, 4th Symposium on Simplicity in Algorithms, SOSA 2021)
null (Ed.)
Full Text Available
An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

https://doi.org/10.1186/s12859-020-03738-5

Chockalingam, Sriram P.; Pannu, Jodh; Hooshmand, Sahar; Thankachan, Sharma V.; Aluru, Srinivas (November 2020, BMC Bioinformatics)
null (Ed.)
Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS , and its k -mismatch counterpart, ACS k , have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS k takes O ( n log k n ) time and hence impractical for large datasets, multiple heuristics that can approximate ACS k have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACS k , which is faster than computing the exact ACS k while being closer to the exact ACS k values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACS k and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .
more » « less
Full Text Available
Sequential and parallel algorithms for all-pair $k$ -mismatch maximal common substrings

https://doi.org/10.1016/j.jpdc.2020.05.018

Chockalingam, Sriram P.; Thankachan, Sharma V.; Aluru, Srinivas (October 2020, Journal of Parallel and Distributed Computing)

Full Text Available
On the Complexity of BWT-Runs Minimization via Alphabet Reordering

https://doi.org/10.4230/LIPIcs.ESA.2020.15

W. Bentley, Jason; Daniel, Gibney; V. Thankachan, Sharma (January 2020, 28th Annual European Symposium on Algorithms, ESA 2020)

Full Text Available
The Fine-Grained Complexity of Median and Center String Problems Under Edit Distance

Hoppenworth, Gary; Bentley, Jason W.; Gibney, Daniel; V. Thankachan, Sharma (January 2020, 28th Annual European Symposium on Algorithms, ESA 2020)
null (Ed.)
Full Text Available
FM-Index Reveals the Reverse Suffix Array

Ganguly, Arnab; Gibney, Daniel; Sahar, Hooshmand; M. Oguzhan, Kulekci; V. Thankachan, Sharma (January 2020, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark)
null (Ed.)
Full Text Available
On the Hardness and Inapproximability of Recognizing Wheeler Graphs

https://doi.org/10.4230/LIPIcs.ESA.2019.51

Gibney, Daniel; Thankachan, Sharma V. (January 2019, 27th Annual European Symposium on Algorithms (ESA) 2019)

In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs [Gagie et al., Theor. Comput. Sci., 2017]. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs and that Wheeler graphs can be indexed in a way which is space-efficient. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We present - The problem of recognizing whether a given graph G=(V,E) is a Wheeler graph is NP-complete for any edge label alphabet of size sigma >= 2, even when G is a DAG. This holds even on a restricted, subset of graphs called d-NFA's for d >= 5. This is in contrast to recent results demonstrating the problem can be solved in polynomial time for d-NFA's where d <= 2. We also show the recognition problem can be solved in linear time for sigma =1; - There exists an 2^{e log sigma + O(n + e)} time exact algorithm where n = |V| and e = |E|. This algorithm relies on graph isomorphism being computable in strictly sub-exponential time; - We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to remove the minimum number of edges in order to obtain a Wheeler graph. We show WGV is APX-hard, even when G is a DAG, implying there exists a constant C >= 1 for which there is no C-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C >= 1, it is NP-hard to find a C-approximation; - We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of the WGV). In contrast to WGV, we prove that the WS problem is in APX for sigma=O(1); The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial-time solvable, raising the open question of which parameters determine this problem's difficulty.
more » « less
Full Text Available

« Prev Next »

Search for: All records