We introduce synchronization strings , which provide a novel way to efficiently deal with synchronization errors , i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to cope with than more commonly considered Hammingtype errors , i.e., symbol substitutions and erasures. For every ε > 0, synchronization strings allow us to index a sequence with an ε O(1) size alphabet, such that one can efficiently transform k synchronization errors into (1 + ε)k Hammingtype errors . This powerful new technique has many applications. In this article, we focus on designing insdel codes , i.e., error correcting block codes (ECCs) for insertiondeletion channels. While ECCs for both Hammingtype errors and synchronization errors have been intensely studied, the latter has largely resisted progress. As Mitzenmacher puts it in his 2009 survey [30]: “ Channels with synchronization errors...are simply not adequately understood by current theory. Given the nearcomplete knowledge, we have for channels with erasures and errors...our lack of understanding about channels with synchronization errors is truly remarkable. ” Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate insdel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regimes, these codes are polynomially far from the optimal ratedistance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago. A straightforward application of our synchronization stringsbased indexing method gives a simple blackbox construction that transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs into the realm of insdel codes. Most notably, for the complete noise spectrum, we obtain efficient “nearMDS” insdel codes, which get arbitrarily close to the optimal ratedistance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0, we give a family of insdel codes achieving a rate of 1  δ  ε over a constantsize alphabet that efficiently corrects a δ fraction of insertions or deletions.
more »
« less
Optimal Document Exchange and New Codes for Small Number of Insertions and Deletions
More Like this


We introduce synchronization strings, which provide a novel way of efficiently dealing with synchronization errors, i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to deal with than more commonly considered halferrors, i.e., symbol corruptions and erasures. For every ε > 0, synchronization strings allow to index a sequence with an εO(1) size alphabet such that one can efficiently transform k synchronization errors into (1 + ε)k halferrors. This powerful new technique has many applications. In this paper we focus on designing insdel codes, i.e., error correcting block codes (ECCs) for insertion deletion channels. While ECCs for both halferrors and synchronization errors have been intensely studied, the later has largely resisted progress. As Mitzenmacher puts it in his 2009 survey: "Channels with synchronization errors ... are simply not adequately understood by current theory. Given the nearcomplete knowledge we have for channels with erasures and errors ... our lack of understanding about channels with synchronization errors is truly remarkable." Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate indel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regime these codes are polynomially far from the optimal ratedistance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago. A straight forward application of our synchronization strings based indexing method gives a simple blackbox construction which transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs over large constant alphabets into the realm of insdel codes. Most notably, for the complete noise spectrum we obtain efficient "nearMDS" insdel codes which get arbitrarily close to the optimal ratedistance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0 we give insdel codes achieving a rate of 1  ξ  ε over a constant size alphabet that efficiently correct a δ´ fraction of insertions or deletions.more » « less

This paper introduces a new datastructural object that we call the tiny pointer. In many applications, traditional log nbit pointers can be replaced with o(log n)bit tiny pointers at the cost of only a constantfactor time overhead and a small probability of failure. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixedsize tiny pointers (i.e., settings in which all of the tiny pointers must be the same size) and variablesize tiny pointers (i.e., settings in which the average tinypointer size must be small, but some tiny pointers can be larger). If a tiny pointer references an element in an array filled to load factor 1 — δ, then the optimal tinypointer size is Θ(log log log n + log δ1) bits in the fixedsize case, and Θ(log δ1) expected bits in the variablesize case. Our tinypointer constructions also require us to revisit several classic problems having to do with balls and bins; these results may be of independent interest. Using tiny pointers, we revisit five classic datastructure problems. We show that: • A data structure storing n vbit values for n keys with constanttime modifications/queries can be implemented to take space nv + O(n log(r) n) bits, for any constant r > 0, as long as the user stores a tiny pointer of expected size O(1) with each key—here, log(r) n is the rth iterated logarithm. • Any binary search tree can be made succinct with constantfactor time overhead, and can even be made to be within O(n) bits of optimal if we allow for O(log* n)time modifications—this holds even for rotationbased trees such as the splay tree and the redblack tree. • Any fixedcapacity keyvalue dictionary can be made stable (i.e., items do not move once inserted) with constanttime overhead and 1 + o(1) space overhead. • Any keyvalue dictionary that requires uniformsize values can be made to support arbitrarysize values with constanttime overhead and with an additional space consumption of log(r) n + O(log j) bits per jbit value for an arbitrary constant r > 0 of our choice. • Given an externalmemory array A of size (1 + ε)n containing a dynamic set of up to n keyvalue pairs, it is possible to maintain an internalmemory stash of size O(n log ε1) bits so that the location of any keyvalue pair in A can be computed in constant time (and with no IOs). These are all well studied and classic problems, and in each case tiny pointers allow for us to take a natural spaceinefficient solution that uses pointers and make it spaceefficient for free.more » « less

There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a blackbox adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the whitebox adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1heavy hitters problem that outperforms the optimal deterministic MisraGries algorithm on long streams. If the whitebox adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any twoplayer deterministic communication lower bound to a lower bound for randomized algorithms robust to a whitebox adversary. In particular, our results show that for all p ≥ 0, there exists a constant Cp > 1 such that any Cpapproximation algorithm for Fp moment estimation in insertiononly streams with a whitebox adversary requires Ω(n) space for a universe of size n. Similarly, there is a constant C > 1 such that any Capproximation algorithm in an insertiononly stream for matrix rank requires Ω(n) space with a whitebox adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of Ω(log n) bits for the fundamental problem of deterministic approximate counting in a stream of 0’s and 1’s, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the whitebox space complexity of a streaming algorithmmore » « less

For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. Stateoftheart hash tables offer the following guarantee: If keys/values are Θ(logn) bits each, then it is possible to achieve constanttime insertions/deletions/queries while wasting only O(loglogn) bits of space per key when compared to the informationtheoretic optimum. Even prior to this bound being achieved, the target of O(log log n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamicallyresized filters). This paper shows that O(log log n) wasted bits per key is not the end of the line for hashing. In fact, for any k ∈ [log∗ n], it is possible to achieve O(k)time insertions/deletions, O(1)time queries, and O(log(k) n) = Ologlog···logn k wasted bits per key (all with high probability in n). This means that, each time we increase inser tion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constanttime hash tables succinct. Our results hold not just for fixedcapacity hash tables, but also for hash tables that are dynamically resized (this is a fundamental departure from what is possible for filters); and for hash tables that store very large keys/values, each of which can be up to no(1) bits (this breaks with the conventional wisdom that larger keys/values should lead to more wasted bits per key). For very small keys/values, we are able to tighten our bounds to o(1) wasted bits per key, even when k = O(1). Building on this, we obtain a constanttime dynamic filter that uses nlogε−1+nloge+o(n) bits of space for a wide choice ofmore » « less