skip to main content

Title: Efficient Data Structures for Range Shortest Unique Substring Queries
Let T[1,n] be a string of length n and T[i,j] be the substring of T starting at position i and ending at position j. A substring T[i,j] of T is a repeat if it occurs more than once in T; otherwise, it is a unique substring of T. Repeats and unique substrings are of great interest in computational biology and information retrieval. Given string T as input, the Shortest Unique Substring problem is to find a shortest substring of T that does not occur elsewhere in T. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over T answering the following type of online queries efficiently. Given a range [α,β], return a shortest substring T[i,j] of T with exactly one occurrence in [α,β]. We present an O(nlogn)-word data structure with O(logwn) query time, where w=Ω(logn) is the word size. Our construction is based on a non-trivial reduction allowing for us to apply a recently introduced optimal geometric data structure [Chan et al., ICALP 2018]. Additionally, we present an O(n)-word data structure with O(nlogϵn) query time, where ϵ>0 is an arbitrarily small more » constant. The latter data structure relies heavily on another geometric data structure [Nekrich and Navarro, SWAT 2012]. « less
Authors:
; ; ;
Award ID(s):
1704552
Publication Date:
NSF-PAR ID:
10286663
Journal Name:
Algorithms
Volume:
13
Issue:
11
Page Range or eLocation-ID:
276
ISSN:
1999-4893
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper revisits the k-mismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log^k n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the any-case O(n^2) time complexity of the prior best method for k-mismatch shortest unique substring finding. Experimental study shows that ourmore »new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the trade-off of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for k-mismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the k-mismatch shortest unique substring finding problem can be solved using O(n log^k n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.« less
  2. We present a general framework of designing efficient dynamic approximate algorithms for optimization on undirected graphs. In particular, we develop a technique that, given any problem that admits a certain notion of vertex sparsifiers, gives data structures that maintain approximate solutions in sub-linear update and query time. We illustrate the applicability of our paradigm to the following problems. (1) A fully-dynamic algorithm that approximates all-pair maximum-flows/minimum-cuts up to a nearly logarithmic factor in $\tilde{O}(n^{2/3})$ amortized time against an oblivious adversary, and $\tilde{O}(m^{3/4})$ time against an adaptive adversary. (2) An incremental data structure that maintains $O(1)$-approximate shortest path in $n^{o(1)}$ timemore »per operation, as well as fully dynamic approximate all-pair shortest path and transshipment in $\tilde{O}(n^{2/3+o(1)})$ amortized time per operation. (3) A fully-dynamic algorithm that approximates all-pair effective resistance up to an $(1+\eps)$ factor in $\tilde{O}(n^{2/3+o(1)} \epsilon^{-O(1)})$ amortized update time per operation. The key tool behind result (1) is the dynamic maintenance of an algorithmic construction due to Madry [FOCS' 10], which partitions a graph into a collection of simpler graph structures (known as j-trees) and approximately captures the cut-flow and metric structure of the graph. The $O(1)$-approximation guarantee of (2) is by adapting the distance oracles by [Thorup-Zwick JACM `05]. Result (3) is obtained by invoking the random-walk based spectral vertex sparsifier by [Durfee et al. STOC `19] in a hierarchical manner, while carefully keeping track of the recourse among levels in the hierarchy.« less
  3. Let f be a drawing in the Euclidean plane of a graph G, which is understood to be a 1-dimensional simplicial complex. We assume that every edge of G is drawn by f as a curve of constant algebraic complexity, and the ratio of the length of the longest simple path to the the length of the shortest edge is poly(n). In the drawing f, a path P of G, or its image in the drawing π = f(P), is β-stretch if π is a simple (non-self-intersecting) curve, and for every pair of distinct points p ∈ P and qmore »∈ P , the length of the sub-curve of π connecting f(p) with f(q) is at most β∥f(p) − f(q)∥, where ∥.∥ denotes the Euclidean distance. We introduce and study the β-stretch Path Problem (βSP for short), in which we are given a pair of vertices s and t of G, and we are to decide whether in the given drawing of G there exists a β-stretch path P connecting s and t. We also output P if it exists. The βSP quantifies a notion of “near straightness” for paths in a graph G, motivated by gerrymandering regions in a map, where edges of G represent natural geographical/political boundaries that may be chosen to bound election districts. The notion of a β-stretch path naturally extends to cycles, and the extension gives a measure of how gerrymandered a district is. Furthermore, we show that the extension is closely related to several studied measures of local fatness of geometric shapes. We prove that βSP is strongly NP-complete. We complement this result by giving a quasi-polynomial time algorithm, that for a given ε > 0, β ∈ O(poly(log |V (G)|)), and s, t ∈ V (G), outputs a β-stretch path between s and t, if a (1 − ε)β-stretch path between s and t exists in the drawing.« less
  4. In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs [Gagie et al., Theor. Comput. Sci., 2017]. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can bemore »represented as Wheeler graphs and that Wheeler graphs can be indexed in a way which is space-efficient. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We present - The problem of recognizing whether a given graph G=(V,E) is a Wheeler graph is NP-complete for any edge label alphabet of size sigma >= 2, even when G is a DAG. This holds even on a restricted, subset of graphs called d-NFA's for d >= 5. This is in contrast to recent results demonstrating the problem can be solved in polynomial time for d-NFA's where d <= 2. We also show the recognition problem can be solved in linear time for sigma =1; - There exists an 2^{e log sigma + O(n + e)} time exact algorithm where n = |V| and e = |E|. This algorithm relies on graph isomorphism being computable in strictly sub-exponential time; - We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to remove the minimum number of edges in order to obtain a Wheeler graph. We show WGV is APX-hard, even when G is a DAG, implying there exists a constant C >= 1 for which there is no C-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C >= 1, it is NP-hard to find a C-approximation; - We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of the WGV). In contrast to WGV, we prove that the WS problem is in APX for sigma=O(1); The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial-time solvable, raising the open question of which parameters determine this problem's difficulty.« less
  5. The 2-Wasserstein distance (or RMS distance) is a useful measure of similarity between probability distributions with exciting applications in machine learning. For discrete distributions, the problem of computing this distance can be expressed in terms of finding a minimum-cost perfect matching on a complete bipartite graph given by two multisets of points A, B ⊂ ℝ2, with |A| = |B| = n, where the ground distance between any two points is the squared Euclidean distance between them. Although there is a near-linear time relative ∊-approximation algorithm for the case where the ground distance is Euclidean (Sharathkumar and Agarwal, JACM 2020),more »all existing relative ∊-approximation algorithms for the RMS distance take Ω(n3/2) time. This is primarily because, unlike Euclidean distance, squared Euclidean distance is not a metric. In this paper, for the RMS distance, we present a new ∊-approximation algorithm that runs in O(n^5/4 poly{log n, 1/∊}) time. Our algorithm is inspired by a recent approach for finding a minimum-cost perfect matching in bipartite planar graphs (Asathulla et al, TALG 2020). Their algorithm depends heavily on the existence of sublinear sized vertex separators as well as shortest path data structures that require planarity. Surprisingly, we are able to design a similar algorithm for a complete geometric graph that is far from planar and does not have any vertex separators. Central components of our algorithm include a quadtree-based distance that approximates the squared Euclidean distance and a data structure that supports both Hungarian search and augmentation in sublinear time.« less