Title: An Ultra-Fast and Parallelizable Algorithm for Finding $k$-Mismatch Shortest Unique Substrings
This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log^k n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n^2) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log^k n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest. more »« less
We extend recent results regarding finding shortest unique substrings (SUSs) to obtain new time-space tradeoffs for this problem and the generalization of finding k-mismatch SUSs. Our new results include the first algorithm for finding a k-mismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We first describe how, given a text T of length n and m words of workspace, with high probability we can find an SUS of length L in O(n(L/m)logL) time using random access to T, or in O(n(L/m)log2(L)loglogσ) time using O((L/m)log2L) sequential passes over T. We then describe how, for constant k, with high probability, we can find a k-mismatch SUS in O(n1+ϵL/m) time using O(nϵL/m) sequential passes over T, again using only m words of workspace. Finally, we also describe a deterministic algorithm that takes O(nτlogσlogn) time to find an SUS using O(n/τ) words of workspace, where τ is a parameter.
Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α,β], return a shortest substring T[i,j] of T with exactly one occurrence in [α,β]. We present an O(nlogn)-word data structure with O(logwn) query time, where w=Ω(logn) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(nlogϵn) query time, where ϵ>0 is an arbitrarily small constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012].
Akmal, Shyan; Vassilevska_Williams, Virginia; Wein, Nicole(
, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Bringmann, Karl; Grohe, Martin; Puppis, Gabriele; Svensson, Ola
(Ed.)
In the k-Disjoint Shortest Paths (k-DSP) problem, we are given a graph G (with positive edge weights) on n nodes and m edges with specified source vertices s_1, … , s_k, and target vertices t_1, … , t_k, and are tasked with determining if G contains vertex-disjoint (s_i,t_i)-shortest paths. For any constant k, it is known that k-DSP can be solved in polynomial time over undirected graphs and directed acyclic graphs (DAGs). However, the exact time complexity of k-DSP remains mysterious, with large gaps between the fastest known algorithms and best conditional lower bounds. In this paper, we obtain faster algorithms for important cases of k-DSP, and present better conditional lower bounds for k-DSP and its variants.
Previous work solved 2-DSP over weighted undirected graphs in O(n⁷) time, and weighted DAGs in O(mn) time. For the main result of this paper, we present optimal linear time algorithms for solving 2-DSP on weighted undirected graphs and DAGs. Our linear time algorithms are algebraic however, and so only solve the detection rather than search version of 2-DSP (we show how to solve the search version in O(mn) time, which is faster than the previous best runtime in weighted undirected graphs, but only matches the previous best runtime for DAGs).
We also obtain a faster algorithm for k-Edge Disjoint Shortest Paths (k-EDSP) in DAGs, the variant of k-DSP where one seeks edge-disjoint instead of vertex-disjoint paths between sources and their corresponding targets. Algorithms for k-EDSP on DAGs from previous work take Ω(m^k) time. We show that k-EDSP can be solved over DAGs in O(mn^{k-1}) time, matching the fastest known runtime for solving k-DSP over DAGs.
Previous work established conditional lower bounds for solving k-DSP and its variants via reductions from detecting cliques in graphs. Prior work implied that k-Clique can be reduced to 2k-DSP in DAGs and undirected graphs with O((kn)²) nodes. We improve this reduction, by showing how to reduce from k-Clique to k-DSP in DAGs and undirected graphs with O((kn)²) nodes (halving the number of paths needed in the reduced instance). A variant of k-DSP is the k-Disjoint Paths (k-DP) problem, where the solution paths no longer need to be shortest paths. Previous work reduced from k-Clique to p-DP in DAGs with O(kn) nodes, for p = k + k(k-1)/2. We improve this by showing a reduction from k-Clique to p-DP, for p = k + ⌊k²/4⌋.
Under the k-Clique Hypothesis from fine-grained complexity, our results establish better conditional lower bounds for k-DSP for all k ≥ 4, and better conditional lower bounds for p-DP for all p ≤ 4031. Notably, our work gives the first nontrivial conditional lower bounds 4-DP in DAGs and 4-DSP in undirected graphs and DAGs. Before our work, nontrivial conditional lower bounds were only known for k-DP and k-DSP on such graphs when k ≥ 6.
Biswas, A.S.; Dory, M.; Ghaffari, M.; Mitrovic, S.; Nazari, Y.(
, 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 2021))
Over the past decade, there has been increasing interest in distributed/parallel algorithms for processing large-scale graphs. By now, we have quite fast algorithms---usually sublogarithmic-time and often poly(łogłog n)-time, or even faster---for a number of fundamental graph problems in the massively parallel computation (MPC) model. This model is a widely-adopted theoretical abstraction of MapReduce style settings, where a number of machines communicate in an all-to-all manner to process large-scale data. Contributing to this line of work on MPC graph algorithms, we present poly(łog k) ε poly(łogłog n) round MPC algorithms for computing O(k^1+o(1) )-spanners in the strongly sublinear regime of local memory. To the best of our knowledge, these are the first sublogarithmic-time MPC algorithms for spanner construction.
As primary applications of our spanners, we get two important implications, as follows: -For the MPC setting, we get an O(łog^2łog n)-round algorithm for O(łog^1+o(1) n) approximation of all pairs shortest paths (APSP) in the near-linear regime of local memory. To the best of our knowledge, this is the first sublogarithmic-time MPC algorithm for distance approximations. -Our result above also extends to the Congested Clique model of distributed computing, with the same round complexity and approximation guarantee. This gives the first sub-logarithmic algorithm for approximating APSP in weighted graphs in the Congested Clique model.
Abstract Background Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS , and its k -mismatch counterpart, ACS k , have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS k takes O ( n log k n ) time and hence impractical for large datasets, multiple heuristics that can approximate ACS k have been introduced. Results In this paper, we present a novel linear-time heuristic to approximate ACS k , which is faster than computing the exact ACS k while being closer to the exact ACS k values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. Conclusions Our method produces a better approximation for ACS k and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .
Allen, Daniel R., Thankachan, Sharma V., and Xu, Bojian. An Ultra-Fast and Parallelizable Algorithm for Finding $k$-Mismatch Shortest Unique Substrings. Retrieved from https://par.nsf.gov/biblio/10286103. IEEE/ACM Transactions on Computational Biology and Bioinformatics . Web. doi:10.1109/TCBB.2020.2968531.
Allen, Daniel R., Thankachan, Sharma V., & Xu, Bojian. An Ultra-Fast and Parallelizable Algorithm for Finding $k$-Mismatch Shortest Unique Substrings. IEEE/ACM Transactions on Computational Biology and Bioinformatics, (). Retrieved from https://par.nsf.gov/biblio/10286103. https://doi.org/10.1109/TCBB.2020.2968531
Allen, Daniel R., Thankachan, Sharma V., and Xu, Bojian.
"An Ultra-Fast and Parallelizable Algorithm for Finding $k$-Mismatch Shortest Unique Substrings". IEEE/ACM Transactions on Computational Biology and Bioinformatics (). Country unknown/Code not available. https://doi.org/10.1109/TCBB.2020.2968531.https://par.nsf.gov/biblio/10286103.
@article{osti_10286103,
place = {Country unknown/Code not available},
title = {An Ultra-Fast and Parallelizable Algorithm for Finding $k$-Mismatch Shortest Unique Substrings},
url = {https://par.nsf.gov/biblio/10286103},
DOI = {10.1109/TCBB.2020.2968531},
abstractNote = {This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log^k n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n^2) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log^k n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.},
journal = {IEEE/ACM Transactions on Computational Biology and Bioinformatics},
author = {Allen, Daniel R. and Thankachan, Sharma V. and Xu, Bojian},
editor = {null}
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.