String edit distances have been used for decades in applications ranging from spelling correction and web search suggestions to DNA analysis. Most string edit distances are variations of the Levenshtein distance and consider only single-character edits. In forensic applications polymorphic genetic markers such as short tandem repeats (STRs) are used. At these repetitive motifs the DNA copying errors consist of more than just single base differences. More often the phenomenon of “stutter” is observed, where the number of repeated units differs (by whole units) from the template. To adapt the Levenshtein distance to be suitable for forensic applications where DNA sequence similarity is of interest, a generalized string edit distance is defined that accommodates the addition or deletion of whole motifs in addition to single-nucleotide edits. A dynamic programming implementation is developed for computing this distance between sequences. The novelty of this algorithm is in handling the complex interactions that arise between multiple- and single-character edits. Forensic examples illustrate the purpose and use of the Restricted Forensic Levenshtein (RFL) distance measure, but applications extend to sequence alignment and string similarity in other biological areas, as well as dynamic programming algorithms more broadly.
more »
« less
This content will become publicly available on April 1, 2026
SecureED: Secure Multiparty Edit Distance for Genomic Sequences
DNA edit distance (ED) measures the minimum number of single nucleotide insertions, substitutions, or deletions required to convert a DNA sequence into another. ED has broad applications in healthcare such as sequence alignment, genome assembly, functional annotation, and drug discovery. Privacy-preserving computation is essential in this context to protect sensitive genomic data. Nonetheless, the existing secure DNA edit distance solutions lack efficiency when handling large data sequences or resort to approximations and fail to accurately compute the metric. In this work, we introduce ScureED, a protocol that tackles these limitations, resulting in a significant performance enhancement of approximately 2-24 times compared to existing methods. Our protocol computes a secure ED between two genomes, each comprising 1,000 letters, in just a few seconds. The underlying technique of our protocol is a novel approach that transforms the established approximate matching technique (i.e., the Ukkonen algorithm) into exact matching, exploiting the inherent similarity in human DNA to achieve cost-effectiveness. Furthermore, we introduce various optimizations tailored for secure computation in scenarios with a limited input domain, such as DNA sequences composed solely of the four nucleotide letters.
more »
« less
- Award ID(s):
- 2115075
- PAR ID:
- 10612524
- Publisher / Repository:
- Proceedings on Privacy Enhancing Technologies
- Date Published:
- Journal Name:
- Proceedings on Privacy Enhancing Technologies
- Volume:
- 2025
- Issue:
- 2
- ISSN:
- 2299-0984
- Page Range / eLocation ID:
- 479 to 493
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Gørtz, Inge Li; Farach-Colton, Martin; Puglisi, Simon J.; Herman, Grzegorz (Ed.)In this paper, we study efficient parallel edit distance algorithms, both in theory and in practice. Given two strings A[1..n] and B[1..m], and a set of operations allowed to edit the strings, the edit distance between A and B is the minimum number of operations required to transform A into B. In this paper, we use edit distance to refer to the Levenshtein distance, which allows for unit-cost single-character edits (insertions, deletions, substitutions). Sequentially, a standard Dynamic Programming (DP) algorithm solves edit distance with Θ(nm) cost. In many real-world applications, the strings to be compared are similar to each other and have small edit distances. To achieve highly practical implementations, we focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve asymptotically better cost bounds than the standard Θ(nm) algorithm when the edit distance is small. We study four algorithms in the paper, including three algorithms based on Breadth-First Search (BFS), and one algorithm based on Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin algorithm. We implement three different data structures for the longest common prefix (LCP) queries needed in the algorithm: the classic solution using parallel suffix array, and two hash-based solutions proposed in this paper. Our DaC-based solution is inspired by the output-insensitive solution proposed by Apostolico et al., and we propose a non-trivial adaption to make it output-sensitive. All of the algorithms studied in this paper have good theoretical guarantees, and they achieve different tradeoffs between work (total number of operations), span (longest dependence chain in the computation), and space. We test and compare our algorithms on both synthetic data and real-world data, including DNA sequences, Wikipedia texts, GitHub repositories, etc. Our BFS-based algorithms outperform the existing parallel edit-distance implementation in ParlayLib in all test cases. On cases with fewer than 10⁵ edits, our algorithm can process input sequences of size 10⁹ in about ten seconds, while ParlayLib can only process sequences of sizes up to 10⁶ in the same amount of time. By comparing our algorithms, we also provide a better understanding of the choice of algorithms for different input patterns. We believe that our paper is the first systematic study in the theory and practice of parallel edit distance.more » « less
-
Klumpp, Stefan (Ed.)Dense arrangements of binding sites within nucleotide sequences can collectively influence downstream transcription rates or initiate biomolecular interactions. For example, natural promoter regions can harbor many overlapping transcription factor binding sites that influence the rate of transcription initiation. Despite the prevalence of overlapping binding sites in nature, rapid design of nucleotide sequences with many overlapping sites remains a challenge. Here, we show that this is an NP-hard problem, coined here as the nucleotide String Packing Problem (SPP). We then introduce a computational technique that efficiently assembles sets of DNA-protein binding sites into dense, contiguous stretches of double-stranded DNA. For the efficient design of nucleotide sequences spanning hundreds of base pairs, we reduce the SPP to an Orienteering Problem with integer distances, and then leverage modern integer linear programming solvers. Our method optimally packs sets of 20–100 binding sites into dense nucleotide arrays of 50–300 base pairs in 0.05–10 seconds. Unlike approximation algorithms or meta-heuristics, our approach finds provably optimal solutions. We demonstrate how our method can generate large sets of diverse sequences suitable for library generation, where the frequency of binding site usage across the returned sequences can be controlled by modulating the objective function. As an example, we then show how adding additional constraints, like the inclusion of sequence elements with fixed positions, allows for the design of bacterial promoters. The nucleotide string packing approach we present can accelerate the design of sequences with complex DNA-protein interactions. When used in combination with synthesis and high-throughput screening, this design strategy could help interrogate how complex binding site arrangements impact either gene expression or biomolecular mechanisms in varied cellular contexts.more » « less
-
Private matching for compute (PMC) establishes a match between two datasets owned by mutually distrusted parties (C and P) and allows the parties to input more data for the matched records for arbitrary downstream secure computation without rerunning the private matching component. The state-of-the-art PMC protocols only support two parties and assume that both parties can participate in computationally intensive secure computation. We observe that such operational overhead limits the adoption of these protocols to solely powerful entities as small data owners or devices with minimal computing power will not be able to participate. We introduce two protocols to delegate PMC from party P to untrusted cloud servers, called delegates, allowing multiple smaller P parties to provide inputs containing identifiers and associated values. Our Delegated Private Matching for Compute protocols, called DPMC and DsPMC, establish a join between the datasets of party C and multiple delegators P based on multiple identifiers and compute secret shares of associated values for the identifiers that the parties have in common. We introduce a rerandomizable encrypted oblivious pseudorandom function (OPRF) primitive, called EO, which allows two parties to encrypt, mask, and shuffle their data. Note that EO may be of independent interest. Our DsPMC protocol limits the leakages of DPMC by combining our EO scheme and secure three-party shuffling. Finally, our implementation demonstrates the efficiency of our constructions by outperforming related works by approximately 10x for the total protocol execution and by at least 20x for the computation on the delegators.more » « less
-
Availability of extensive genetics data across multiple individuals and populations is driving the growing importance of graph based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Though research on sequence to graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this paper, we study sequence to graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence to graph alignment problem is NP -complete under both Hamming and edit distance models for alphabets of size ≥2 . For the case where only changes to the sequence are permitted, we present an O(|V|+m|E|) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the run-time complexity of existing algorithms.more » « less
An official website of the United States government
