 Award ID(s):
 2317838
 NSFPAR ID:
 10485540
 Publisher / Repository:
 ACMSIAM
 Date Published:
 Journal Name:
 Proceedings of the 2023 Annual ACMSIAM Symposium on Discrete Algorithms (SODA)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Mikolaj Bojanczyk ; Emanuela Merelli ; David P. Woodruff (Ed.)Two equal length strings are a parameterized match (pmatch) iff there exists a onetoone function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+P1] and P are a pmatch. The PST can count the number of occurrences of P in T in time O(Plog σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant.  Space O(n logσ) bits and query time O(log_σ^ε n);  Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and  Space O(n logσ log^ε_σ n) bits and query time O(1). The first tradeoff is an improvement over Ganguly et al.’s result, whereas our third tradeoff matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our tradeoffs match the spaceandtime bounds of the bestknown compressed text indexes for exact pattern matching and further improvement is highly unlikely.more » « less

For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. Stateoftheart hash tables offer the following guarantee: If keys/values are Θ(logn) bits each, then it is possible to achieve constanttime insertions/deletions/queries while wasting only O(loglogn) bits of space per key when compared to the informationtheoretic optimum. Even prior to this bound being achieved, the target of O(log log n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamicallyresized filters). This paper shows that O(log log n) wasted bits per key is not the end of the line for hashing. In fact, for any k ∈ [log∗ n], it is possible to achieve O(k)time insertions/deletions, O(1)time queries, and O(log(k) n) = Ologlog···logn k wasted bits per key (all with high probability in n). This means that, each time we increase inser tion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constanttime hash tables succinct. Our results hold not just for fixedcapacity hash tables, but also for hash tables that are dynamically resized (this is a fundamental departure from what is possible for filters); and for hash tables that store very large keys/values, each of which can be up to no(1) bits (this breaks with the conventional wisdom that larger keys/values should lead to more wasted bits per key). For very small keys/values, we are able to tighten our bounds to o(1) wasted bits per key, even when k = O(1). Building on this, we obtain a constanttime dynamic filter that uses nlogε−1+nloge+o(n) bits of space for a wide choice ofmore » « less


Many problems on data streams have been studied at two extremes of difficulty: either allowing randomized algorithms, in the static setting (where they should err with bounded probability on the worst case stream); or when only deterministic and infallible algorithms are required. Some recent works have considered the adversarial setting, in which a randomized streaming algorithm must succeed even on data streams provided by an adaptive adversary that can see the intermediate outputs of the algorithm. In order to better understand the differences between these models, we study a streaming task called “Missing Item Finding”. In this problem, for r < n, one is given a data stream a1 , . . . , ar of elements in [n], (possibly with repetitions), and must output some x ∈ [n] which does not equal any of the ai. We prove that, for r = nΘ(1) and δ = 1/poly(n), the space required for randomized algorithms that solve this problem in the static setting with error δ is Θ(polylog(n)); for algorithms in the adversarial setting with error δ, Θ((1 + r2/n)polylog(n)); and for deterministic algorithms, Θ(r/polylog(n)). Because our adversarially robust algorithm relies on free access to a string of O(r log n) random bits, we investigate a “random start” model of streaming algorithms where all random bits used are included in the space cost. Here we find a conditional lower bound on the space usage, which depends on the space that would be needed for a pseudodeterministic algorithm to solve the problem. We also prove an Ω(r/polylog(n)) lower bound for the space needed by a streaming algorithm with < 1/2polylog(n) error against “whitebox” adversaries that can see the internal state of the algorithm, but not predict its future random decisions.more » « less

Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different tradeoffs have been proposed.
This article introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constanttime operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to
1  o(1) while being stable, meaning that the position where an element is stored only ever changes when resizes occur. In fact, in the setting where keys are Θ (logn ) bits, the space guarantees that Iceberg hashing offers, namely that it uses at most bits to store\(\log \binom{U}{n} + O(n \log \ \text{log} n)\) n items from a universeU , matches a lower bound by Demaine et al. that applies to any stable hash table.Iceberg hashing introduces new generalpurpose techniques for some of the most basic aspects of hashtable design. Notably, our indirectionfree technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and veryhigh probability guarantees, can be applied to any hash table that makes use of the frontyard/backyard paradigm for hash table design.