skip to main content


Title: Optimal Document Exchange and New Codes for Small Number of Insertions and Deletions
We give the first communication-optimal document exchange protocol. For any n and k more » « less
Award ID(s):
1814603 1750808 1910588 1618280
NSF-PAR ID:
10121529
Author(s) / Creator(s):
Date Published:
Journal Name:
IEEE Symposium on Foundations of Computer Science
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce synchronization strings , which provide a novel way to efficiently deal with synchronization errors , i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to cope with than more commonly considered Hamming-type errors , i.e., symbol substitutions and erasures. For every ε > 0, synchronization strings allow us to index a sequence with an ε -O(1) -size alphabet, such that one can efficiently transform k synchronization errors into (1 + ε)k Hamming-type errors . This powerful new technique has many applications. In this article, we focus on designing insdel codes , i.e., error correcting block codes (ECCs) for insertion-deletion channels. While ECCs for both Hamming-type errors and synchronization errors have been intensely studied, the latter has largely resisted progress. As Mitzenmacher puts it in his 2009 survey [30]: “ Channels with synchronization errors...are simply not adequately understood by current theory. Given the near-complete knowledge, we have for channels with erasures and errors...our lack of understanding about channels with synchronization errors is truly remarkable. ” Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate insdel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regimes, these codes are polynomially far from the optimal rate-distance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago. A straightforward application of our synchronization strings-based indexing method gives a simple black-box construction that transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs into the realm of insdel codes. Most notably, for the complete noise spectrum, we obtain efficient “near-MDS” insdel codes, which get arbitrarily close to the optimal rate-distance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0, we give a family of insdel codes achieving a rate of 1 - δ - ε over a constant-size alphabet that efficiently corrects a δ fraction of insertions or deletions. 
    more » « less
  2. We introduce synchronization strings, which provide a novel way of efficiently dealing with synchronization errors, i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to deal with than more commonly considered half-errors, i.e., symbol corruptions and erasures. For every ε > 0, synchronization strings allow to index a sequence with an ε-O(1) size alphabet such that one can efficiently transform k synchronization errors into (1 + ε)k half-errors. This powerful new technique has many applications. In this paper we focus on designing insdel codes, i.e., error correcting block codes (ECCs) for insertion deletion channels. While ECCs for both half-errors and synchronization errors have been intensely studied, the later has largely resisted progress. As Mitzenmacher puts it in his 2009 survey: "Channels with synchronization errors ... are simply not adequately understood by current theory. Given the near-complete knowledge we have for channels with erasures and errors ... our lack of understanding about channels with synchronization errors is truly remarkable." Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate indel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regime these codes are polynomially far from the optimal rate-distance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago. A straight forward application of our synchronization strings based indexing method gives a simple black-box construction which transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs over large constant alphabets into the realm of insdel codes. Most notably, for the complete noise spectrum we obtain efficient "near-MDS" insdel codes which get arbitrarily close to the optimal rate-distance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0 we give insdel codes achieving a rate of 1 - ξ - ε over a constant size alphabet that efficiently correct a δ´ fraction of insertions or deletions. 
    more » « less
  3. This paper introduces a new data-structural object that we call the tiny pointer. In many applications, traditional log n-bit pointers can be replaced with o(log n)-bit tiny pointers at the cost of only a constant-factor time overhead and a small probability of failure. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixed-size tiny pointers (i.e., settings in which all of the tiny pointers must be the same size) and variable-size tiny pointers (i.e., settings in which the average tiny-pointer size must be small, but some tiny pointers can be larger). If a tiny pointer references an element in an array filled to load factor 1 — δ, then the optimal tiny-pointer size is Θ(log log log n + log δ-1) bits in the fixed-size case, and Θ(log δ-1) expected bits in the variable-size case. Our tiny-pointer constructions also require us to revisit several classic problems having to do with balls and bins; these results may be of independent interest. Using tiny pointers, we revisit five classic data-structure problems. We show that: • A data structure storing n v-bit values for n keys with constant-time modifications/queries can be implemented to take space nv + O(n log(r) n) bits, for any constant r > 0, as long as the user stores a tiny pointer of expected size O(1) with each key—here, log(r) n is the r-th iterated logarithm. • Any binary search tree can be made succinct with constant-factor time overhead, and can even be made to be within O(n) bits of optimal if we allow for O(log* n)-time modifications—this holds even for rotation-based trees such as the splay tree and the red-black tree. • Any fixed-capacity key-value dictionary can be made stable (i.e., items do not move once inserted) with constant-time overhead and 1 + o(1) space overhead. • Any key-value dictionary that requires uniform-size values can be made to support arbitrary-size values with constant-time overhead and with an additional space consumption of log(r) n + O(log j) bits per j-bit value for an arbitrary constant r > 0 of our choice. • Given an external-memory array A of size (1 + ε)n containing a dynamic set of up to n key-value pairs, it is possible to maintain an internal-memory stash of size O(n log ε-1) bits so that the location of any key-value pair in A can be computed in constant time (and with no IOs). These are all well studied and classic problems, and in each case tiny pointers allow for us to take a natural space-inefficient solution that uses pointers and make it space-efficient for free. 
    more » « less
  4. There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a black-box adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the white-box adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1-heavy hitters problem that outperforms the optimal deterministic Misra-Gries algorithm on long streams. If the white-box adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1-heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any two-player deterministic communication lower bound to a lower bound for randomized algorithms robust to a white-box adversary. In particular, our results show that for all p ≥ 0, there exists a constant Cp > 1 such that any Cp-approximation algorithm for Fp moment estimation in insertion-only streams with a white-box adversary requires Ω(n) space for a universe of size n. Similarly, there is a constant C > 1 such that any C-approximation algorithm in an insertion-only stream for matrix rank requires Ω(n) space with a white-box adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of Ω(log n) bits for the fundamental problem of deterministic approximate counting in a stream of 0’s and 1’s, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the white-box space complexity of a streaming algorithm 
    more » « less
  5. For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. State-of-the-art hash tables offer the following guarantee: If keys/values are Θ(logn) bits each, then it is possible to achieve constant-time insertions/deletions/queries while wasting only O(loglogn) bits of space per key when compared to the information-theoretic optimum. Even prior to this bound being achieved, the target of O(log log n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamically-resized filters). This paper shows that O(log log n) wasted bits per key is not the end of the line for hashing. In fact, for any k ∈ [log∗ n], it is possible to achieve O(k)-time insertions/deletions, O(1)-time queries, and O(log(k) n) = Ologlog···logn 􏰟 􏰞􏰝 􏰠 k wasted bits per key (all with high probability in n). This means that, each time we increase inser- tion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constant-time hash tables succinct. Our results hold not just for fixed-capacity hash tables, but also for hash tables that are dynamically resized (this is a fundamental departure from what is possible for filters); and for hash tables that store very large keys/values, each of which can be up to no(1) bits (this breaks with the conventional wisdom that larger keys/values should lead to more wasted bits per key). For very small keys/values, we are able to tighten our bounds to o(1) wasted bits per key, even when k = O(1). Building on this, we obtain a constant-time dynamic filter that uses n􏰕logε−1􏰖+nloge+o(n) bits of space for a wide choice of 
    more » « less