skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fully Functional Parameterized Suffix Trees in Compact Space
Two equal length strings are a parameterized match (p-match) iff there exists a one-to-one function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+|P|-1] and P are a p-match. The PST can count the number of occurrences of P in T in time O(|P|log σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant. - Space O(n logσ) bits and query time O(log_σ^ε n); - Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and - Space O(n logσ log^ε_σ n) bits and query time O(1). The first trade-off is an improvement over Ganguly et al.’s result, whereas our third trade-off matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our trade-offs match the space-and-time bounds of the best-known compressed text indexes for exact pattern matching and further improvement is highly unlikely.  more » « less
Award ID(s):
2137057
PAR ID:
10406160
Author(s) / Creator(s):
; ;
Editor(s):
Mikolaj Bojanczyk; Emanuela Merelli; David P. Woodruff
Date Published:
Journal Name:
49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)
Volume:
229
Page Range / eLocation ID:
65:1-65:18
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Beyersdorff, Olaf; Pilipczuk, Michał; Pimentel, Elaine; Thắng, Nguyễn Kim (Ed.)
    For a length n text over an alphabet of size σ, we can encode the suffix tree data structure in 𝒪(nlog σ) bits of space. It supports suffix array (SA), inverse suffix array (ISA), and longest common extension (LCE) queries in 𝒪(log^ε_σ n) time, which enables efficient pattern matching; here ε > 0 is an arbitrarily small constant. Further improvements are possible for LCE queries, where 𝒪(1) time queries can be achieved using an index of space 𝒪(nlog σ) bits. However, compactly indexing a two-dimensional text (i.e., an n× n matrix) has been a major open problem. We show progress in this direction by first presenting an 𝒪(n²log σ)-bit structure supporting LCE queries in near 𝒪((log_σ n)^{2/3}) time. We then present an 𝒪(n²log σ + n²log log n)-bit structure supporting ISA queries in near 𝒪(log n ⋅ (log_σ n)^{2/3}) time. Within a similar space, achieving SA queries in poly-logarithmic (even strongly sub-linear) time is a significant challenge. However, our 𝒪(n²log σ + n²log log n)-bit structure can support SA queries in 𝒪(n²/(σ log n)^c) time, where c is an arbitrarily large constant, which enables pattern matching in time faster than what is possible without preprocessing. We then design a repetition-aware data structure. The δ_2D compressibility measure for two-dimensional texts was recently introduced by Carfagna and Manzini [SPIRE 2023]. The measure ranges from 1 to n², with smaller δ_2D indicating a highly compressible two-dimensional text. The current data structure utilizing δ_2D allows only element access. We obtain the first structure based on δ_2D for LCE queries. It takes 𝒪^{~}(n^{5/3} + n^{8/5}δ_2D^{1/5}) space and answers queries in 𝒪(log n) time. 
    more » « less
  2. Censor-Hillel, Keren; Grandoni, Fabrizio; Ouaknine, Joel; Puppis, Gabriele (Ed.)
    We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n)) The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters). 
    more » « less
  3. The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection{T1,T2,… ,Td}ofdstrings (called documents) of total lengthninto a data structure, such that for any given query(P,k), wherePis a string (called pattern) of lengthp ≥ 1andk ∈ [1,d]is an integer, the identifiers of thosekdocuments that are most relevant toPcan be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n)-space (in words) data structure withO(p+klogk)query time. The query time was later improved toO(p+k)[SODA 2012] and further toO(p/logσn+k)[SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσis the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n)-space and answer queries inO(p/B+ logBn + k/B+log*(n/B)) I/Os, whereBis the block size. The second one takesO(nlog*(n/B)) space and answer queries in optimalO(p/B+ logBn + k/B)I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-kdocument retrieval, we present anO(nlog(d/B))space data structure with optimal query cost. 
    more » « less
  4. We revisit the following version of the Gapped String Indexing problem, where the goal is to preprocess a text T[1..n] to enable efficient reporting of all occ occurrences of a gapped pattern P=P1[α..β]P2 in T. An occurrence of P in T is defined as a pair (i,j) where substrings T[i..i+|P1|) and T[j..j+|P2|) match P1 and P2, respectively, with a gap j−(i+|P1|) lying within the interval [α..β]. This problem has significant applications in computational biology and text mining. A hardness result on this problem suggests that any index with polylogarithmic query time must occupy near quadratic space. In a recent study [STACS 2024], Bille et al. presented a sub-quadratic space index using space O˜(n2−δ/3), where 0≤δ≤1 is a parameter fixed at the time of index construction. Its query time is O˜(|P1|+|P2|+nδ·(1+occ)), which is sub-linear per occurrence when δ<1. We show how to achieve a gap-sensitive query time of O˜(|P1|+|P2|+nδ·(1+occ1−δ)+∑g∈[α..β]occg·gδ) using the same space, where occg denotes the number of occurrences with gap g. This is faster when there are many occurrences with small gaps. 
    more » « less
  5. For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. State-of-the-art hash tables offer the following guarantee: If keys/values are Θ(logn) bits each, then it is possible to achieve constant-time insertions/deletions/queries while wasting only O(loglogn) bits of space per key when compared to the information-theoretic optimum. Even prior to this bound being achieved, the target of O(log log n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamically-resized filters). This paper shows that O(log log n) wasted bits per key is not the end of the line for hashing. In fact, for any k ∈ [log∗ n], it is possible to achieve O(k)-time insertions/deletions, O(1)-time queries, and O(log(k) n) = Ologlog···logn 􏰟 􏰞􏰝 􏰠 k wasted bits per key (all with high probability in n). This means that, each time we increase inser- tion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constant-time hash tables succinct. Our results hold not just for fixed-capacity hash tables, but also for hash tables that are dynamically resized (this is a fundamental departure from what is possible for filters); and for hash tables that store very large keys/values, each of which can be up to no(1) bits (this breaks with the conventional wisdom that larger keys/values should lead to more wasted bits per key). For very small keys/values, we are able to tighten our bounds to o(1) wasted bits per key, even when k = O(1). Building on this, we obtain a constant-time dynamic filter that uses n􏰕logε−1􏰖+nloge+o(n) bits of space for a wide choice of 
    more » « less