skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on September 4, 2024

Title: Efficient Parallel Output-Sensitive Edit Distance
In this paper, we study efficient parallel edit distance algorithms, both in theory and in practice. Given two strings A[1..n] and B[1..m], and a set of operations allowed to edit the strings, the edit distance between A and B is the minimum number of operations required to transform A into B. In this paper, we use edit distance to refer to the Levenshtein distance, which allows for unit-cost single-character edits (insertions, deletions, substitutions). Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with Θ(nm) cost. In many real-world applications, the strings to be compared are similar to each other and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard Θ(nm) algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS), and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All of the algorithms studied in this paper have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data, including DNA sequences, Wikipedia texts, GitHub repositories, etc. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. On cases with fewer than 10⁵ edits, our algorithm can process input sequences of size 10⁹ in about ten seconds, while ParlayLib can only process sequences of sizes up to 10⁶ in the same amount of time. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.  more » « less
Award ID(s):
2238358 2227669 2103483
NSF-PAR ID:
10489409
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Gørtz, Inge Li; Farach-Colton, Martin; Puglisi, Simon J.; Herman, Grzegorz
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Journal Name:
31st Annual European Symposium on Algorithms (ESA 2023)
Subject(s) / Keyword(s):
["Edit Distance","Parallel Algorithms","String Algorithms","Dynamic Programming","Pattern Matching","Theory of computation → Parallel algorithms"]
Format(s):
Medium: X
Location:
Amsterdam
Sponsoring Org:
National Science Foundation
More Like this
  1. Tauman Kalai, Yael (Ed.)
    The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a ≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ε)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k ≥ ED_a(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ε > 0; then, our algorithm takes Õ(n/a + ak³) time. Unless a = Õ(1), in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ((nk_I)/k_S + k_S k_I³)-time (1,1+ε)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a = Θ(k_S/k_I), and its running time is again sublinear in n whenever k_I ≪ k_S and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ε)-approximation in sublinear time, even for a favorable choice of k. 
    more » « less
  2. null (Ed.)
    The edit distance between two strings is defined as the smallest number of insertions , deletions , and substitutions that need to be made to transform one of the strings to another one. Approximating edit distance in subquadratic time is “one of the biggest unsolved problems in the field of combinatorial pattern matching” [37]. Our main result is a quantum constant approximation algorithm for computing the edit distance in truly subquadratic time. More precisely, we give an quantum algorithm that approximates the edit distance within a factor of 3. We further extend this result to an quantum algorithm that approximates the edit distance within a larger constant factor. Our solutions are based on a framework for approximating edit distance in parallel settings. This framework requires as black box an algorithm that computes the distances of several smaller strings all at once. For a quantum algorithm, we reduce the black box to metric estimation and provide efficient algorithms for approximating it. We further show that this framework enables us to approximate edit distance in distributed settings. To this end, we provide a MapReduce algorithm to approximate edit distance within a factor of , with sublinearly many machines and sublinear memory. Also, our algorithm runs in a logarithmic number of rounds. 
    more » « less
  3. Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01X (up to 132.67X) speedup and a 33.82X energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69X to 2.19X speedup and consumes 4.77X to 8.91X less energy. GRAPHR gains a speedup of 1.16X to 4.12X, and is 3.67X to 10.96X more energy efficiency compared to PIM-based architecture. 
    more » « less
  4. Abstract Background

    Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existingk-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps.

    Results

    In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be$$(d_1, d_2)$$(d1,d2)-sensitive if any two sequences within an edit distance of$$d_1$$d1are mapped into at least one shared bucket, and any two sequences with distance at least$$d_2$$d2are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of$$(d_1,d_2)$$(d1,d2)and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal.

    Conclusion

    These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.

     
    more » « less
  5. Motivation: Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results: We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation: Data and source code for reproducing the experiments are available at: https:// github.com/Kingsford-Group/gtedemedtest/. 
    more » « less