skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Efficient Parallel Output-Sensitive Edit Distance
In this paper, we study efficient parallel edit distance algorithms, both in theory and in practice. Given two strings A[1..n] and B[1..m], and a set of operations allowed to edit the strings, the edit distance between A and B is the minimum number of operations required to transform A into B. In this paper, we use edit distance to refer to the Levenshtein distance, which allows for unit-cost single-character edits (insertions, deletions, substitutions). Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with Θ(nm) cost. In many real-world applications, the strings to be compared are similar to each other and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard Θ(nm) algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS), and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All of the algorithms studied in this paper have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data, including DNA sequences, Wikipedia texts, GitHub repositories, etc. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. On cases with fewer than 10⁵ edits, our algorithm can process input sequences of size 10⁹ in about ten seconds, while ParlayLib can only process sequences of sizes up to 10⁶ in the same amount of time. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.  more » « less
Award ID(s):
2238358 2227669 2103483
PAR ID:
10489409
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Gørtz, Inge Li; Farach-Colton, Martin; Puglisi, Simon J.; Herman, Grzegorz
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Journal Name:
31st Annual European Symposium on Algorithms (ESA 2023)
ISBN:
978-3-95977-295-2
Subject(s) / Keyword(s):
Edit Distance Parallel Algorithms String Algorithms Dynamic Programming Pattern Matching Theory of computation → Parallel algorithms
Format(s):
Medium: X
Location:
Amsterdam
Sponsoring Org:
National Science Foundation
More Like this
  1. Tauman Kalai, Yael (Ed.)
    The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a ≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ε)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k ≥ ED_a(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ε > 0; then, our algorithm takes Õ(n/a + ak³) time. Unless a = Õ(1), in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ((nk_I)/k_S + k_S k_I³)-time (1,1+ε)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a = Θ(k_S/k_I), and its running time is again sublinear in n whenever k_I ≪ k_S and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ε)-approximation in sublinear time, even for a favorable choice of k. 
    more » « less
  2. null (Ed.)
    The edit distance between two strings is defined as the smallest number of insertions , deletions , and substitutions that need to be made to transform one of the strings to another one. Approximating edit distance in subquadratic time is “one of the biggest unsolved problems in the field of combinatorial pattern matching” [37]. Our main result is a quantum constant approximation algorithm for computing the edit distance in truly subquadratic time. More precisely, we give an quantum algorithm that approximates the edit distance within a factor of 3. We further extend this result to an quantum algorithm that approximates the edit distance within a larger constant factor. Our solutions are based on a framework for approximating edit distance in parallel settings. This framework requires as black box an algorithm that computes the distances of several smaller strings all at once. For a quantum algorithm, we reduce the black box to metric estimation and provide efficient algorithms for approximating it. We further show that this framework enables us to approximate edit distance in distributed settings. To this end, we provide a MapReduce algorithm to approximate edit distance within a factor of , with sublinearly many machines and sublinear memory. Also, our algorithm runs in a logarithmic number of rounds. 
    more » « less
  3. We introduce the notion of Levenshtein graphs, an analog to Hamming graphs but using the edit distance instead of the Hamming distance; in particular, vertices in Levenshtein graphs may be strings (i.e., words or sequences of characters in a reference alphabet) of possibly different lengths. We study various properties of these graphs, including a necessary and sufficient condition for their shortest path distance to be identical to the edit distance, and characterize their automorphism group and determining number. We also bound the metric dimension (i.e. minimum resolving set size) of Levenshtein graphs. Regarding the latter, recall that a run is a string composed of identical characters. We construct a resolving set of two-run strings and an algorithm that computes the edit distance between a string of length k and any single-run or two-run string in operations. 
    more » « less
  4. In this paper we define and investigate the Fréchet edit distance problem. Here, given two polygonal curves $$\pi$$ and $$\sigma$$ and a threshhold value $$\delta$$ , we seek the minimum number of edits to $$\sigma$$ such that the Fréchet distance between the edited curve and $$\pi$$ is at most $$\delta$$. For the edit operations we consider three cases, namely, deletion of vertices, insertion of vertices, or both. For this basic problem we consider a number of variants. Specifically, we provide polynomial time algorithms for both discrete and continuous Fréchet edit distance variants, as well as hardness results for weak Fréchet edit distance variants. 
    more » « less
  5. One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu’s parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm’s performance. 
    more » « less