skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Near-linear time insertion-deletion codes and (1+ ε)-approximating edit distance via indexing
We introduce fast-decodable indexing schemes for edit distance which can be used to speed up edit distance computations to near-linear time if one of the strings is indexed by an indexing string I. In particular, for every length n and every ε >0, one can in near linear time construct a string I ∈ Σ′n with |Σ′| = Oε(1), such that, indexing any string S ∈ Σn, symbol-by-symbol, with I results in a string S′ ∈ Σ″n where Σ″ = Σ × Σ′ for which edit distance computations are easy, i.e., one can compute a (1+ε)-approximation of the edit distance between S′ and any other string in O(n (log n)) time. Our indexing schemes can be used to improve the decoding complexity of state-of-the-art error correcting codes for insertions and deletions. In particular, they lead to near-linear time decoding algorithms for the insertion-deletion codes of [Haeupler, Shahrasbi; STOC ‘17] and faster decoding algorithms for list-decodable insertion-deletion codes of [Haeupler, Shahrasbi, Sudan; ICALP ‘18]. Interestingly, the latter codes are a crucial ingredient in the construction of fast-decodable indexing schemes.  more » « less
Award ID(s):
1910588 1814603 1750808 1618280
PAR ID:
10121505
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
ACM Symposium on Theory of Computing
Page Range / eLocation ID:
697 to 708
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper the theory and design of codes capable of correcting t insertion/deletion of the symbol 0 in each and every bucket of zeros (i. e., zeros in between two consecutive ones) are studied. It is shown that this problem is related to the zero error capacity achieving codes in limited magnitude error channel. Close to optimal non-systematic code designs and the encoding/decoding algorithms are described. 
    more » « less
  2. Tauman Kalai, Yael (Ed.)
    The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a ≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ε)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k ≥ ED_a(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ε > 0; then, our algorithm takes Õ(n/a + ak³) time. Unless a = Õ(1), in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ((nk_I)/k_S + k_S k_I³)-time (1,1+ε)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a = Θ(k_S/k_I), and its running time is again sublinear in n whenever k_I ≪ k_S and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ε)-approximation in sublinear time, even for a favorable choice of k. 
    more » « less
  3. null (Ed.)
    The Gilbert-Varshamov bound (non-constructively) establishes the existence of binary codes of distance 1/2-ε and rate Ω(ε 2 ) (where an upper bound of O(ε 2 log(1/ε)) is known). Ta-Shma [STOC 2017] gave an explicit construction of ε-balanced binary codes, where any two distinct codewords are at a distance between 1/2-ε/2 and 1/2+ε/2, achieving a near optimal rate of Ω(ε 2+β ), where β→ 0 as ε→ 0. We develop unique and list decoding algorithms for (a slight modification of) the family of codes constructed by Ta-Shma, in the adversarial error model. We prove the following results for ε-balanced codes with block length N and rate Ω(ε 2+β ) in this family: -For all , there are explicit codes which can be uniquely decoded up to an error of half the minimum distance in time N Oε,β(1) . -For any fixed constant β independent of ε, there is an explicit construction of codes which can be uniquely decoded up to an error of half the minimum distance in time (log(1/ε)) O(1) ·N Oβ(1) . -For any , there are explicit ε-balanced codes with rate Ω(ε 2+β ) which can be list decoded up to error 1/2-ε ' in time N Oε,ε' ,β(1), where ε ' ,β→ 0 as ε→ 0. The starting point of our algorithms is the framework for list decoding direct-sum codes develop in Alev et al. [SODA 2020], which uses the Sum-of-Squares SDP hierarchy. The rates obtained there were quasipolynomial in ε. Here, we show how to overcome the far from optimal rates of this framework obtaining unique decoding algorithms for explicit binary codes of near optimal rate. These codes are based on simple modifications of Ta-Shma's construction. 
    more » « less
  4. Mulzer, Wolfgang; Phillips, Jeff M (Ed.)
    We define and investigate the Fréchet edit distance problem. Given two polygonal curves π and σ and a non-negative threshhold value δ, we seek the minimum number of edits to σ such that the Fréchet distance between the edited σ and π is at most δ. For the edit operations we consider three cases, namely, deletion of vertices, insertion of vertices, or both. For this basic problem we consider a number of variants. Specifically, we provide polynomial time algorithms for both discrete and continuous Fréchet edit distance variants, as well as hardness results for weak Fréchet edit distance variants. 
    more » « less
  5. We introduce synchronization strings , which provide a novel way to efficiently deal with synchronization errors , i.e., insertions and deletions. Synchronization errors are strictly more general and much harder to cope with than more commonly considered Hamming-type errors , i.e., symbol substitutions and erasures. For every ε > 0, synchronization strings allow us to index a sequence with an ε -O(1) -size alphabet, such that one can efficiently transform k synchronization errors into (1 + ε)k Hamming-type errors . This powerful new technique has many applications. In this article, we focus on designing insdel codes , i.e., error correcting block codes (ECCs) for insertion-deletion channels. While ECCs for both Hamming-type errors and synchronization errors have been intensely studied, the latter has largely resisted progress. As Mitzenmacher puts it in his 2009 survey [30]: “ Channels with synchronization errors...are simply not adequately understood by current theory. Given the near-complete knowledge, we have for channels with erasures and errors...our lack of understanding about channels with synchronization errors is truly remarkable. ” Indeed, it took until 1999 for the first insdel codes with constant rate, constant distance, and constant alphabet size to be constructed and only since 2016 are there constructions of constant rate insdel codes for asymptotically large noise rates. Even in the asymptotically large or small noise regimes, these codes are polynomially far from the optimal rate-distance tradeoff. This makes the understanding of insdel codes up to this work equivalent to what was known for regular ECCs after Forney introduced concatenated codes in his doctoral thesis 50 years ago. A straightforward application of our synchronization strings-based indexing method gives a simple black-box construction that transforms any ECC into an equally efficient insdel code with only a small increase in the alphabet size. This instantly transfers much of the highly developed understanding for regular ECCs into the realm of insdel codes. Most notably, for the complete noise spectrum, we obtain efficient “near-MDS” insdel codes, which get arbitrarily close to the optimal rate-distance tradeoff given by the Singleton bound. In particular, for any δ ∈ (0,1) and ε > 0, we give a family of insdel codes achieving a rate of 1 - δ - ε over a constant-size alphabet that efficiently corrects a δ fraction of insertions or deletions. 
    more » « less