The k-deck problem is concerned with finding the smallest positive integer S(k) such that there exist at least two strings of length S(k) that share the same k-deck, i.e., the multiset of subsequences of length k. We introduce the new problem of gapped k-deck reconstruction: For a given gap parameter s, we seek the smallest positive integer Gs(k) such that there exist at least two distinct strings of length Gs(k) that cannot be distinguished based on a "gapped" set of k-subsequences. The gap constraint requires the elements in the subsequences to be at least s positions apart within the original string. Our results are as follows. First, we show how to construct sequences sharing the same 2-gapped k-deck using a nontrivial modification of the recursive Morse-Thue string construction procedure. This establishes the first known constructive upper bound on G2(k). Second, we further improve this bound using the approach by Dudik and Schulman.
more »
« less
The Gapped k-Deck Problem
The k-deck problem is concerned with finding the smallest positive integer S(k) such that there exist at least two strings of length S(k) that share the same k-deck, i.e., the multiset of subsequences of length k. We introduce the new problem of gapped k-deck reconstruction: For a given gap parameter s, we seek the smallest positive integer G s (k) such that there exist at least two distinct strings of length G s (k) that cannot be distinguished based on a "gapped" set of k-subsequences. The gap constraint requires the elements in the subsequences to be at least s positions apart within the original string. Our results are as follows. First, we show how to construct sequences sharing the same 2-gapped k-deck using a nontrivial modification of the recursive Morse-Thue string construction procedure. This establishes the first known constructive upper bound on G 2 (k). Second, we further improve this bound using the approach by Dudik and Schulman [6].
more »
« less
- Award ID(s):
- 2008125
- PAR ID:
- 10356401
- Date Published:
- Journal Name:
- International Symposium on Information Theory and its Applications
- ISSN:
- 2689-5838
- Page Range / eLocation ID:
- 49-54
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationModern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. ResultsWe propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. Availability and implementationSubseqHash is freely available at https://github.com/Shao-Group/subseqhash.more » « less
-
One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu’s parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm’s performance.more » « less
-
One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu’s parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm’s performance.more » « less
-
Gørtz, Inge Li; Farach-Colton, Martin; Puglisi, Simon J; Herman, Grzegorz (Ed.)The k-Detour problem is a basic path-finding problem: given a graph G on n vertices, with specified nodes s and t, and a positive integer k, the goal is to determine if G has an st-path of length exactly dist(s,t) + k, where dist(s,t) is the length of a shortest path from s to t. The k-Detour problem is NP-hard when k is part of the input, so researchers have sought efficient parameterized algorithms for this task, running in f(k)poly(n) time, for f(⋅) as slow-growing as possible. We present faster algorithms for k-Detour in undirected graphs, running in 1.853^k poly(n) randomized and 4.082^kpoly(n) deterministic time. The previous fastest algorithms for this problem took 2.746^k poly(n) randomized and 6.523^k poly(n) deterministic time [Bezáková-Curticapean-Dell-Fomin, ICALP 2017]. Our algorithms use the fact that detecting a path of a given length in an undirected graph is easier if we are promised that the path belongs to what we call a "bipartitioned" subgraph, where the nodes are split into two parts and the path must satisfy constraints on those parts. Previously, this idea was used to obtain the fastest known algorithm for finding paths of length k in undirected graphs [Björklund-Husfeldt-Kaski-Koivisto, JCSS 2017], intuitively by looking for paths of length k in randomly bipartitioned subgraphs. Our algorithms for k-Detour stem from a new application of this idea, which does not involve choosing the bipartitioned subgraphs randomly. Our work has direct implications for the k-Longest Detour problem, another related path-finding problem. In this problem, we are given the same input as in k-Detour, but are now tasked with determining if G has an st-path of length at least dist(s,t)+k. Our results for k-Detour imply that we can solve k-Longest Detour in 3.432^k poly(n) randomized and 16.661^k poly(n) deterministic time. The previous fastest algorithms for this problem took 7.539^k poly(n) randomized and 42.549^k poly(n) deterministic time [Fomin et al., STACS 2022].more » « less
An official website of the United States government

