NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

On the Hardness of Sequence Alignment on De Bruijn Graphs

https://doi.org/10.1089/cmb.2022.0411

Gibney, Daniel; Thankachan, Sharma V.; Aluru, Srinivas (December 2022, Journal of Computational Biology)

Full Text Available
Algorithms for Colinear Chaining with Overlaps and Gap Costs

https://doi.org/10.1089/cmb.2022.0266

Jain, Chirag; Gibney, Daniel; Thankachan, Sharma V. (November 2022, Journal of Computational Biology)

Full Text Available
Feasibility of flow decomposition with subpath constraints in linear time

Gibney, Daniel; Thankachan; Sharma V; Aluru, S. (January 2022, 22nd International Workshop on Algorithms in Bioinformatics, Leibniz International Proceedings in Informatics)
Boucher, Chritina; Rahmann, Sven. (Ed.)
Full Text Available
Co-linear Chaining with Overlaps and Gap Costs

https://doi.org/10.1007/978-3-031-04749-7_15

Jain, Chirag; Gibney, Daniel; Thankachan, Sharma V. (January 2022, Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022)

Full Text Available
The Complexity of Approximate Pattern Matching on de Bruijn Graphs

https://doi.org/10.1007/978-3-031-04749-7_16

Gibney, Daniel; Thankachan, Sharma V.; Aluru, Srinivas (January 2022, Research in Computational Molecular Biology - 26th Annual International Conference, RECOMB 2022)

Full Text Available
An Ultra-Fast and Parallelizable Algorithm for Finding $$k$$-Mismatch Shortest Unique Substrings

https://doi.org/10.1109/TCBB.2020.2968531

Allen, Daniel R.; Thankachan, Sharma V.; Xu, Bojian (January 2021, IEEE/ACM Transactions on Computational Biology and Bioinformatics)
null (Ed.)
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log^k n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n^2) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log^k n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.
more » « less
Full Text Available
Efficient Data Structures for Range Shortest Unique Substring Queries

https://doi.org/10.3390/a13110276

Abedin, Paniz; Ganguly, Arnab; Pissis, Solon P.; Thankachan, Sharma V. (November 2020, Algorithms)
null (Ed.)
Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α,β], return a shortest substring T[i,j] of T with exactly one occurrence in [α,β]. We present an O(nlogn)-word data structure with O(logwn) query time, where w=Ω(logn) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(nlogϵn) query time, where ϵ>0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].
more » « less
Full Text Available
An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction

https://doi.org/10.1186/s12859-020-03738-5

Chockalingam, Sriram P.; Pannu, Jodh; Hooshmand, Sahar; Thankachan, Sharma V.; Aluru, Srinivas (November 2020, BMC Bioinformatics)
null (Ed.)
Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS , and its k -mismatch counterpart, ACS k , have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS k takes O ( n log k n ) time and hence impractical for large datasets, multiple heuristics that can approximate ACS k have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACS k , which is faster than computing the exact ACS k while being closer to the exact ACS k values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACS k and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .
more » « less
Full Text Available
Sequential and parallel algorithms for all-pair $k$ -mismatch maximal common substrings

https://doi.org/10.1016/j.jpdc.2020.05.018

Chockalingam, Sriram P.; Thankachan, Sharma V.; Aluru, Srinivas (October 2020, Journal of Parallel and Distributed Computing)

Full Text Available
On the Complexity of BWT-Runs Minimization via Alphabet Reordering

https://doi.org/10.4230/LIPIcs.ESA.2020.15

W. Bentley, Jason; Daniel, Gibney; V. Thankachan, Sharma (January 2020, 28th Annual European Symposium on Algorithms, ESA 2020)

Full Text Available

« Prev Next »

Search for: All records