The ranked (or top
Fully Functional Parameterized Suffix Trees in Compact Space
Two equal length strings are a parameterized match (pmatch) iff there exists a onetoone function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+P1] and P are a pmatch. The PST can count the number of occurrences of P in T in time O(Plog σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant.
 Space O(n logσ) bits and query time O(log_σ^ε n);
 Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and
 Space O(n logσ log^ε_σ n) bits and query time O(1).
The first tradeoff is an improvement over Ganguly et al.’s result, whereas our third tradeoff matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our tradeoffs match the spaceandtime bounds of the bestknown compressed text indexes for exact pattern matching and further improvement is highly unlikely.
more »
« less
 Award ID(s):
 2137057
 NSFPAR ID:
 10406160
 Editor(s):
 Mikolaj Bojanczyk; Emanuela Merelli; David P. Woodruff
 Date Published:
 Journal Name:
 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)
 Volume:
 229
 Page Range / eLocation ID:
 65:165:18
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

k ) document retrieval problem is defined as follows: preprocess a collection{T_{1},T_{2},… ,T_{d}} ofd strings (called documents) of total lengthn into a data structure, such that for any given query(P,k) , whereP is a string (called pattern) of lengthp ≥ 1 andk ∈ [1,d] is an integer, the identifiers of thosek documents that are most relevant toP can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n) space (in words) data structure withO(p+k logk) query time. The query time was later improved toO(p+k) [SODA 2012] and further toO(p/ log_{σn+k}) [SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσ is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n) space and answer queries inO(p/B + log_{B}n + k/B+ log^{*}(n/B) ) I/Os, whereB is the block size. The second one takesO(n log^{*}(n/B) ) space and answer queries in optimalO(p/B + log_{B}n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted topk document retrieval, we present anO(n log(d/B)) space data structure with optimal query cost. 
In recent years several compressed indexes based on variants of the BurrowsWheeler transformation have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FMindex [Ferragina and Manzini, J. ACM 2005]. As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs [Gagie et al., Theor. Comput. Sci., 2017]. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs and that Wheeler graphs can be indexed in a way which is spaceefficient. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We present  The problem of recognizing whether a given graph G=(V,E) is a Wheeler graph is NPcomplete for any edge label alphabet of size sigma >= 2, even when G is a DAG. This holds even on a restricted, subset of graphs called dNFA's for d >= 5. This is in contrast to recent results demonstrating the problem can be solved in polynomial time for dNFA's where d <= 2. We also show the recognition problem can be solved in linear time for sigma =1;  There exists an 2^{e log sigma + O(n + e)} time exact algorithm where n = V and e = E. This algorithm relies on graph isomorphism being computable in strictly subexponential time;  We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to remove the minimum number of edges in order to obtain a Wheeler graph. We show WGV is APXhard, even when G is a DAG, implying there exists a constant C >= 1 for which there is no Capproximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C >= 1, it is NPhard to find a Capproximation;  We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of the WGV). In contrast to WGV, we prove that the WS problem is in APX for sigma=O(1); The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomialtime solvable, raising the open question of which parameters determine this problem's difficulty.more » « less

Given a metric space ℳ = (X,δ), a weighted graph G over X is a metric tspanner of ℳ if for every u,v ∈ X, δ(u,v) ≤ δ_G(u,v) ≤ t⋅ δ(u,v), where δ_G is the shortest path metric in G. In this paper, we construct spanners for finite sets in metric spaces in the online setting. Here, we are given a sequence of points (s₁, …, s_n), where the points are presented one at a time (i.e., after i steps, we have seen S_i = {s₁, … , s_i}). The algorithm is allowed to add edges to the spanner when a new point arrives, however, it is not allowed to remove any edge from the spanner. The goal is to maintain a tspanner G_i for S_i for all i, while minimizing the number of edges, and their total weight. Under the L₂norm in ℝ^d for arbitrary constant d ∈ ℕ, we present an online (1+ε)spanner algorithm with competitive ratio O_d(ε^{d} log n), improving the previous bound of O_d(ε^{(d+1)}log n). Moreover, the spanner maintained by the algorithm has O_d(ε^{1d}log ε^{1})⋅ n edges, almost matching the (offline) optimal bound of O_d(ε^{1d})⋅ n. In the plane, a tighter analysis of the same algorithm provides an almost quadratic improvement of the competitive ratio to O(ε^{3/2}logε^{1}log n), by comparing the online spanner with an instanceoptimal spanner directly, bypassing the comparison to an MST (i.e., lightness). As a counterpart, we design a sequence of points that yields a Ω_d(ε^{d}) lower bound for the competitive ratio for online (1+ε)spanner algorithms in ℝ^d under the L₁norm. Then we turn our attention to online spanners in general metrics. Note that, it is not possible to obtain a spanner with stretch less than 3 with a subquadratic number of edges, even in the offline setting, for general metrics. We analyze an online version of the celebrated greedy spanner algorithm, dubbed ordered greedy. With stretch factor t = (2k1)(1+ε) for k ≥ 2 and ε ∈ (0,1), we show that it maintains a spanner with O(ε^{1}logε^{1})⋅ n^{1+1/k} edges and O(ε^{1}n^{1/k}log² n) lightness for a sequence of n points in a metric space. We show that these bounds cannot be significantly improved, by introducing an instance that achieves an Ω(1/k⋅ n^{1/k}) competitive ratio on both sparsity and lightness. Furthermore, we establish the tradeoff among stretch, number of edges and lightness for points in ultrametrics, showing that one can maintain a (2+ε)spanner for ultrametrics with O(ε^{1}logε^{1})⋅ n edges and O(ε^{2}) lightness.more » « less

Given a metric space ℳ = (X,δ), a weighted graph G over X is a metric tspanner of ℳ if for every u,v ∈ X, δ(u,v) ≤ δ_G(u,v) ≤ t⋅ δ(u,v), where δ_G is the shortest path metric in G. In this paper, we construct spanners for finite sets in metric spaces in the online setting. Here, we are given a sequence of points (s₁, …, s_n), where the points are presented one at a time (i.e., after i steps, we have seen S_i = {s₁, … , s_i}). The algorithm is allowed to add edges to the spanner when a new point arrives, however, it is not allowed to remove any edge from the spanner. The goal is to maintain a tspanner G_i for S_i for all i, while minimizing the number of edges, and their total weight. Under the L₂norm in ℝ^d for arbitrary constant d ∈ ℕ, we present an online (1+ε)spanner algorithm with competitive ratio O_d(ε^{d} log n), improving the previous bound of O_d(ε^{(d+1)}log n). Moreover, the spanner maintained by the algorithm has O_d(ε^{1d}log ε^{1})⋅ n edges, almost matching the (offline) optimal bound of O_d(ε^{1d})⋅ n. In the plane, a tighter analysis of the same algorithm provides an almost quadratic improvement of the competitive ratio to O(ε^{3/2}logε^{1}log n), by comparing the online spanner with an instanceoptimal spanner directly, bypassing the comparison to an MST (i.e., lightness). As a counterpart, we design a sequence of points that yields a Ω_d(ε^{d}) lower bound for the competitive ratio for online (1+ε)spanner algorithms in ℝ^d under the L₁norm. Then we turn our attention to online spanners in general metrics. Note that, it is not possible to obtain a spanner with stretch less than 3 with a subquadratic number of edges, even in the offline setting, for general metrics. We analyze an online version of the celebrated greedy spanner algorithm, dubbed ordered greedy. With stretch factor t = (2k1)(1+ε) for k ≥ 2 and ε ∈ (0,1), we show that it maintains a spanner with O(ε^{1}logε^{1})⋅ n^{1+1/k} edges and O(ε^{1}n^{1/k}log² n) lightness for a sequence of n points in a metric space. We show that these bounds cannot be significantly improved, by introducing an instance that achieves an Ω(1/k⋅ n^{1/k}) competitive ratio on both sparsity and lightness. Furthermore, we establish the tradeoff among stretch, number of edges and lightness for points in ultrametrics, showing that one can maintain a (2+ε)spanner for ultrametrics with O(ε^{1}logε^{1})⋅ n edges and O(ε^{2}) lightness.more » « less

Etessami, Kousha ; Feige, Uriel ; Puppis, Gabriele (Ed.)Many streaming algorithms provide only a highprobability relative approximation. These two relaxations, of allowing approximation and randomization, seem necessary  for many streaming problems, both relaxations must be employed simultaneously, to avoid an exponentially larger (and often trivial) space complexity. A common drawback of these randomized approximate algorithms is that independent executions on the same input have different outputs, that depend on their random coins. Pseudodeterministic algorithms combat this issue, and for every input, they output with high probability the same "canonical" solution. We consider perhaps the most basic problem in data streams, of counting the number of items in a stream of length at most n. Morris’s counter [CACM, 1978] is a randomized approximation algorithm for this problem that uses O(log log n) bits of space, for every fixed approximation factor (greater than 1). Goldwasser, Grossman, Mohanty and Woodruff [ITCS 2020] asked whether pseudodeterministic approximation algorithms can match this space complexity. Our main result answers their question negatively, and shows that such algorithms must use Ω(√{log n / log log n}) bits of space. Our approach is based on a problem that we call Shift Finding, and may be of independent interest. In this problem, one has query access to a shifted version of a known string F ∈ {0,1}^{3n}, which is guaranteed to start with n zeros and end with n ones, and the goal is to find the unknown shift using a small number of queries. We provide for this problem an algorithm that uses O(√n) queries. It remains open whether poly(log n) queries suffice; if true, then our techniques immediately imply a nearlytight Ω(log n/log log n) space bound for pseudodeterministic approximate counting.more » « less