We introduce synchronization strings, which provide a novel way of efficiently dealing with synchronization errors, i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to deal with than more commonly considered halferrors, i.e., symbol corruptions and erasures. For every ε > 0, synchronization strings allow to index a sequence with an εO(1) size alphabet such that one can efficiently transform k synchronization errors into (1 + ε)k halferrors. This powerful new technique has many applications. In this paper we focus on designing insdel codes, i.e., error correcting block codes (ECCs) for insertion deletion channels.
While ECCs for both halferrors and synchronization errors have been intensely studied, the later has largely resisted progress. As Mitzenmacher puts it in his 2009 survey: "Channels with synchronization errors ... are simply not adequately understood by current theory. Given the nearcomplete knowledge we have for channels with erasures and errors ... our lack of understanding about channels with synchronization errors is truly remarkable." Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate indel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regime these codes are polynomially far from the optimal ratedistance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago.
A straight forward application of our synchronization strings based indexing method gives a simple blackbox construction which transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs over large constant alphabets into the realm of insdel codes. Most notably, for the complete noise spectrum we obtain efficient "nearMDS" insdel codes which get arbitrarily close to the optimal ratedistance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0 we give insdel codes achieving a rate of 1  ξ  ε over a constant size alphabet that efficiently correct a δ´ fraction of insertions or deletions.
more »
« less
Nearlinear time insertiondeletion codes and (1+ ε)approximating edit distance via indexing
We introduce fastdecodable indexing schemes for edit distance which can be used to speed up edit distance computations to nearlinear time if one of the strings is indexed by an indexing string I. In particular, for every length n and every ε >0, one can in near linear time construct a string I ∈ Σ′n with Σ′ = Oε(1), such that, indexing any string S ∈ Σn, symbolbysymbol, with I results in a string S′ ∈ Σ″n where Σ″ = Σ × Σ′ for which edit distance computations are easy, i.e., one can compute a (1+ε)approximation of the edit distance between S′ and any other string in O(n (log n)) time.
Our indexing schemes can be used to improve the decoding complexity of stateoftheart error correcting codes for insertions and deletions. In particular, they lead to nearlinear time decoding algorithms for the insertiondeletion codes of [Haeupler, Shahrasbi; STOC ‘17] and faster decoding algorithms for listdecodable insertiondeletion codes of [Haeupler, Shahrasbi, Sudan; ICALP ‘18]. Interestingly, the latter codes are a crucial ingredient in the construction of fastdecodable indexing schemes.
more »
« less
 NSFPAR ID:
 10121505
 Date Published:
 Journal Name:
 ACM Symposium on Theory of Computing
 Page Range / eLocation ID:
 697 to 708
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We introduce synchronization strings , which provide a novel way to efficiently deal with synchronization errors , i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to cope with than more commonly considered Hammingtype errors , i.e., symbol substitutions and erasures. For every ε > 0, synchronization strings allow us to index a sequence with an ε O(1) size alphabet, such that one can efficiently transform k synchronization errors into (1 + ε)k Hammingtype errors . This powerful new technique has many applications. In this article, we focus on designing insdel codes , i.e., error correcting block codes (ECCs) for insertiondeletion channels. While ECCs for both Hammingtype errors and synchronization errors have been intensely studied, the latter has largely resisted progress. As Mitzenmacher puts it in his 2009 survey [30]: “ Channels with synchronization errors...are simply not adequately understood by current theory. Given the nearcomplete knowledge, we have for channels with erasures and errors...our lack of understanding about channels with synchronization errors is truly remarkable. ” Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate insdel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regimes, these codes are polynomially far from the optimal ratedistance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago. A straightforward application of our synchronization stringsbased indexing method gives a simple blackbox construction that transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs into the realm of insdel codes. Most notably, for the complete noise spectrum, we obtain efficient “nearMDS” insdel codes, which get arbitrarily close to the optimal ratedistance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0, we give a family of insdel codes achieving a rate of 1  δ  ε over a constantsize alphabet that efficiently corrects a δ fraction of insertions or deletions.more » « less

TaShma, Amnon (Ed.)Locally Decodable Codes (LDCs) are errorcorrecting codes C:Σⁿ → Σ^m, encoding messages in Σⁿ to codewords in Σ^m, with superfast decoding algorithms. They are important mathematical objects in many areas of theoretical computer science, yet the best constructions so far have codeword length m that is superpolynomial in n, for codes with constant query complexity and constant alphabet size. In a very surprising result, BenSasson, Goldreich, Harsha, Sudan, and Vadhan (SICOMP 2006) show how to construct a relaxed version of LDCs (RLDCs) with constant query complexity and almost linear codeword length over the binary alphabet, and used them to obtain significantlyimproved constructions of Probabilistically Checkable Proofs. In this work, we study RLDCs in the standard Hammingerror setting, and introduce their variants in the insertion and deletion (Insdel) error setting. Standard LDCs for Insdel errors were first studied by Ostrovsky and PaskinCherniavsky (Information Theoretic Security, 2015), and are further motivated by recent advances in DNA random access biotechnologies. Our first result is an exponential lower bound on the length of Hamming RLDCs making 2 queries (even adaptively), over the binary alphabet. This answers a question explicitly raised by Gur and Lachish (SICOMP 2021) and is the first exponential lower bound for RLDCs. Combined with the results of BenSasson et al., our result exhibits a "phasetransition"type behavior on the codeword length for some constantquery complexity. We achieve these lower bounds via a transformation of RLDCs to standard Hamming LDCs, using a careful analysis of restrictions of message bits that fix codeword bits. We further define two variants of RLDCs in the Insdelerror setting, a weak and a strong version. On the one hand, we construct weak Insdel RLDCs with almost linear codeword length and constant query complexity, matching the parameters of the Hamming variants. On the other hand, we prove exponential lower bounds for strong Insdel RLDCs. These results demonstrate that, while these variants are equivalent in the Hamming setting, they are significantly different in the insdel setting. Our results also prove a strict separation between Hamming RLDCs and Insdel RLDCs.more » « less

null (Ed.)The GilbertVarshamov bound (nonconstructively) establishes the existence of binary codes of distance 1/2ε and rate Ω(ε 2 ) (where an upper bound of O(ε 2 log(1/ε)) is known). TaShma [STOC 2017] gave an explicit construction of εbalanced binary codes, where any two distinct codewords are at a distance between 1/2ε/2 and 1/2+ε/2, achieving a near optimal rate of Ω(ε 2+β ), where β→ 0 as ε→ 0. We develop unique and list decoding algorithms for (a slight modification of) the family of codes constructed by TaShma, in the adversarial error model. We prove the following results for εbalanced codes with block length N and rate Ω(ε 2+β ) in this family: For all , there are explicit codes which can be uniquely decoded up to an error of half the minimum distance in time N Oε,β(1) . For any fixed constant β independent of ε, there is an explicit construction of codes which can be uniquely decoded up to an error of half the minimum distance in time (log(1/ε)) O(1) ·N Oβ(1) . For any , there are explicit εbalanced codes with rate Ω(ε 2+β ) which can be list decoded up to error 1/2ε ' in time N Oε,ε' ,β(1), where ε ' ,β→ 0 as ε→ 0. The starting point of our algorithms is the framework for list decoding directsum codes develop in Alev et al. [SODA 2020], which uses the SumofSquares SDP hierarchy. The rates obtained there were quasipolynomial in ε. Here, we show how to overcome the far from optimal rates of this framework obtaining unique decoding algorithms for explicit binary codes of near optimal rate. These codes are based on simple modifications of TaShma's construction.more » « less

null (Ed.)A collection of sets displays a proximity gap with respect to some property if for every set in the collection, either (i) all members are δclose to the property in relative Hamming distance or (ii) only a tiny fraction of members are δclose to the property. In particular, no set in the collection has roughly half of its members δclose to the property and the others δfar from it. We show that the collection of affine spaces displays a proximity gap with respect to ReedSolomon (RS) codes, even over small fields, of size polynomial in the dimension of the code, and the gap applies to any δ smaller than the Johnson/GuruswamiSudan listdecoding bound of the RS code. We also show nearoptimal gap results, over fields of (at least) linear size in the RS code dimension, for δ smaller than the unique decoding radius. Concretely, if δ is smaller than half the minimal distance of an RS code V ⊂ Fq n , every affine space is either entirely δclose to the code, or alternatively at most an ( n/q)fraction of it is δclose to the code. Finally, we discuss several applications of our proximity gap results to distributed storage, multiparty cryptographic protocols, and concretely efficient proof systems. We prove the proximity gap results by analyzing the execution of classical algebraic decoding algorithms for ReedSolomon codes (due to BerlekampWelch and GuruswamiSudan) on a formal element of an affine space. This involves working with ReedSolomon codes whose base field is an (infinite) rational function field. Our proofs are obtained by developing an extension (to function fields) of a strategy of Arora and Sudan for analyzing lowdegree tests.more » « less