We introduce the notion of Levenshtein graphs, an analog to Hamming graphs but using the edit distance instead of the Hamming distance; in particular, vertices in Levenshtein graphs may be strings (i.e., words or sequences of characters in a reference alphabet) of possibly different lengths. We study various properties of these graphs, including a necessary and sufficient condition for their shortest path distance to be identical to the edit distance, and characterize their automorphism group and determining number. We also bound the metric dimension (i.e. minimum resolving set size) of Levenshtein graphs. Regarding the latter, recall that a run is a string composed of identical characters. We construct a resolving set of two-run strings and an algorithm that computes the edit distance between a string of length k and any single-run or two-run string in operations.
more »
« less
On the Complexity of Sequence to Graph Alignment
Availability of extensive genetics data across multiple individuals and populations is driving the growing importance of graph based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Though research on sequence to graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this paper, we study sequence to graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence to graph alignment problem is NP -complete under both Hamming and edit distance models for alphabets of size ≥2 . For the case where only changes to the sequence are permitted, we present an O(|V|+m|E|) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the run-time complexity of existing algorithms.
more »
« less
- Award ID(s):
- 1816027
- PAR ID:
- 10122205
- Date Published:
- Journal Name:
- Lecture Notes in Computer Science
- Volume:
- 11467
- Page Range / eLocation ID:
- 85-100
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Tauman Kalai, Yael (Ed.)The edit distance between strings classically assigns unit cost to every character insertion, deletion, and substitution, whereas the Hamming distance only allows substitutions. In many real-life scenarios, insertions and deletions (abbreviated indels) appear frequently but significantly less so than substitutions. To model this, we consider substitutions being cheaper than indels, with cost 1/a for a parameter a ≥ 1. This basic variant, denoted ED_a, bridges classical edit distance (a = 1) with Hamming distance (a → ∞), leading to interesting algorithmic challenges: Does the time complexity of computing ED_a interpolate between that of Hamming distance (linear time) and edit distance (quadratic time)? What about approximating ED_a? We first present a simple deterministic exact algorithm for ED_a and further prove that it is near-optimal assuming the Orthogonal Vectors Conjecture. Our main result is a randomized algorithm computing a (1+ε)-approximation of ED_a(X,Y), given strings X,Y of total length n and a bound k ≥ ED_a(X,Y). For simplicity, let us focus on k ≥ 1 and a constant ε > 0; then, our algorithm takes Õ(n/a + ak³) time. Unless a = Õ(1), in which case ED_a resembles the standard edit distance, and for the most interesting regime of small enough k, this running time is sublinear in n. We also consider a very natural version that asks to find a (k_I, k_S)-alignment, i.e., an alignment with at most k_I indels and k_S substitutions. In this setting, we give an exact algorithm and, more importantly, an Õ((nk_I)/k_S + k_S k_I³)-time (1,1+ε)-bicriteria approximation algorithm. The latter solution is based on the techniques we develop for ED_a for a = Θ(k_S/k_I), and its running time is again sublinear in n whenever k_I ≪ k_S and the overall distance is small enough. These bounds are in stark contrast to unit-cost edit distance, where state-of-the-art algorithms are far from achieving (1+ε)-approximation in sublinear time, even for a favorable choice of k.more » « less
-
This article presents I/O-efficient algorithms for topologically sorting a directed acyclic graph and for the more general problem identifying and topologically sorting the strongly connected components of a directed graph G = ( V, E ). Both algorithms are randomized and have I/O-costs O ( sort ( E ) · poly(log V)), with high probability, where sort ( E ) = O( E / B log M / B ( E/B )) is the I/O cost of sorting an | E |-element array on a machine with size- B blocks and size- M cache/internal memory. These are the first algorithms for these problems that do not incur at least one I/O per vertex, and as such these are the first I/O-efficient algorithms for sparse graphs. By applying the technique of time-forward processing, these algorithms also imply I/O-efficient algorithms for most problems on directed acyclic graphs, such as shortest paths, as well as the single-source reachability problem on arbitrary directed graphs.more » « less
-
null (Ed.)A collection of sets displays a proximity gap with respect to some property if for every set in the collection, either (i) all members are δ-close to the property in relative Hamming distance or (ii) only a tiny fraction of members are δ-close to the property. In particular, no set in the collection has roughly half of its members δ-close to the property and the others δ-far from it. We show that the collection of affine spaces displays a proximity gap with respect to Reed-Solomon (RS) codes, even over small fields, of size polynomial in the dimension of the code, and the gap applies to any δ smaller than the Johnson/Guruswami-Sudan list-decoding bound of the RS code. We also show near-optimal gap results, over fields of (at least) linear size in the RS code dimension, for δ smaller than the unique decoding radius. Concretely, if δ is smaller than half the minimal distance of an RS code V ⊂ Fq n , every affine space is either entirely δ-close to the code, or alternatively at most an ( n/q)-fraction of it is δ-close to the code. Finally, we discuss several applications of our proximity gap results to distributed storage, multi-party cryptographic protocols, and concretely efficient proof systems. We prove the proximity gap results by analyzing the execution of classical algebraic decoding algorithms for Reed-Solomon codes (due to Berlekamp-Welch and Guruswami-Sudan) on a formal element of an affine space. This involves working with Reed-Solomon codes whose base field is an (infinite) rational function field. Our proofs are obtained by developing an extension (to function fields) of a strategy of Arora and Sudan for analyzing low-degree tests.more » « less
-
Mulzer, Wolfgang; Phillips, Jeff M (Ed.)It is unlikely that the discrete Fréchet distance between two curves of length n can be computed in strictly subquadratic time. We thus consider the setting where one of the curves, P, is known in advance. In particular, we wish to construct data structures (distance oracles) of near-linear size that support efficient distance queries with respect to P in sublinear time. Since there is evidence that this is impossible for query curves of length Θ(n^α), for any α > 0, we focus on query curves of (small) constant length, for which we are able to devise distance oracles with the desired bounds. We extend our tools to handle subcurves of the given curve, and even arbitrary vertex-to-vertex subcurves of a given geometric tree. That is, we construct an oracle that can quickly compute the distance between a short polygonal path (the query) and a path in the preprocessed tree between two query-specified vertices. Moreover, we define a new family of geometric graphs, t-local graphs (which strictly contains the family of geometric spanners with constant stretch), for which a similar oracle exists: we can preprocess a graph G in the family, so that, given a query segment and a pair u,v of vertices in G, one can quickly compute the smallest discrete Fréchet distance between the segment and any (u,v)-path in G. The answer is exact, if t = 1, and approximate if t > 1.more » « less
An official website of the United States government

