We extend recent results regarding finding shortest unique substrings (SUSs) to obtain new timespace tradeoffs for this problem and the generalization of finding kmismatch SUSs. Our new results include the first algorithm for finding a kmismatch SUS in sublinear space, which we obtain by extending an algorithm by Senanayaka (2019) and combining it with a result on sketching by Gawrychowski and Starikovskaya (2019). We first describe how, given a text T of length n and m words of workspace, with high probability we can find an SUS of length L in O(n(L/m)logL) time using random access to T, or inmore »
An UltraFast and Parallelizable Algorithm for Finding $k$Mismatch Shortest Unique Substrings
This paper revisits the kmismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the kmismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log^k n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the anycase O(n^2) time complexity of the prior best method for kmismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than more »
 Publication Date:
 NSFPAR ID:
 10286103
 Journal Name:
 IEEE/ACM Transactions on Computational Biology and Bioinformatics
 Page Range or eLocationID:
 1 to 1
 ISSN:
 15455963
 Sponsoring Org:
 National Science Foundation
More Like this


Memoryhard functions (MHFs) are a key cryptographic primitive underlying the design of moderately expensive password hashing algorithms and egalitarian proofs of work. Over the past few years several increasingly stringent goals for an MHF have been proposed including the requirement that the MHF have high sequential spacetime (ST) complexity, parallel spacetime complexity, amortized areatime (aAT) complexity and sustained space complexity. DataIndependent Memory Hard Functions (iMHFs) are of special interest in the context of password hashing as they naturally resist sidechannel attacks. iMHFs can be specified using a directed acyclic graph (DAG) $G$ with $N=2^n$ nodes and low indegree and themore »

Abstract Background Alignmentfree methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS , and its k mismatch counterpart, ACS k , have been shown to produce results as effective as multiplesequence alignment based methods for reconstruction of phylogeny trees. Since computing ACS k takes O ( n log k n ) time and hence impractical for large datasets, multiple heuristics that can approximate ACS k have been introduced. Results In this paper, we present a novel lineartime heuristic tomore »

Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which wemore »

Consider an algorithm performing a computation on a huge random object (for example a random graph or a "long" random walk). Is it necessary to generate the entire object prior to the computation, or is it possible to provide query access to the object and sample it incrementally "onthefly" (as requested by the algorithm)? Such an implementation should emulate the random object by answering queries in a manner consistent with an instance of the random object sampled from the true distribution (or close to it). This paradigm is useful when the algorithm is sublinear and thus, sampling the entire objectmore »