skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Gapped k-Deck Problem
The k-deck problem is concerned with finding the smallest positive integer S(k) such that there exist at least two strings of length S(k) that share the same k-deck, i.e., the multiset of subsequences of length k. We introduce the new problem of gapped k-deck reconstruction: For a given gap parameter s, we seek the smallest positive integer Gs(k) such that there exist at least two distinct strings of length Gs(k) that cannot be distinguished based on a "gapped" set of k-subsequences. The gap constraint requires the elements in the subsequences to be at least s positions apart within the original string. Our results are as follows. First, we show how to construct sequences sharing the same 2-gapped k-deck using a nontrivial modification of the recursive Morse-Thue string construction procedure. This establishes the first known constructive upper bound on G2(k). Second, we further improve this bound using the approach by Dudik and Schulman.  more » « less
Award ID(s):
2008125
PAR ID:
10356397
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the ISIT
Page Range / eLocation ID:
49-54
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The k-deck problem is concerned with finding the smallest positive integer S(k) such that there exist at least two strings of length S(k) that share the same k-deck, i.e., the multiset of subsequences of length k. We introduce the new problem of gapped k-deck reconstruction: For a given gap parameter s, we seek the smallest positive integer G s (k) such that there exist at least two distinct strings of length G s (k) that cannot be distinguished based on a "gapped" set of k-subsequences. The gap constraint requires the elements in the subsequences to be at least s positions apart within the original string. Our results are as follows. First, we show how to construct sequences sharing the same 2-gapped k-deck using a nontrivial modification of the recursive Morse-Thue string construction procedure. This establishes the first known constructive upper bound on G 2 (k). Second, we further improve this bound using the approach by Dudik and Schulman [6]. 
    more » « less
  2. Abstract MotivationModern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. ResultsWe propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. Availability and implementationSubseqHash is freely available at https://github.com/Shao-Group/subseqhash. 
    more » « less
  3. One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu’s parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm’s performance. 
    more » « less
  4. One of the most critical problems in the field of string algorithms is the longest common subsequence problem (LCS). The problem is NP-hard for an arbitrary number of strings but can be solved in polynomial time for a fixed number of strings. In this paper, we select a typical parallel LCS algorithm and integrate it into our large-scale string analysis algorithm library to support different types of large string analysis. Specifically, we take advantage of the high-level parallel language, Chapel, to integrate Lu and Liu’s parallel LCS algorithm into Arkouda, an open-source framework. Through Arkouda, data scientists can easily handle large string analytics on the back-end high-performance computing resources from the front-end Python interface. The Chapel-enabled parallel LCS algorithm can identify the longest common subsequences of two strings, and experimental results are given to show how the number of parallel resources and the length of input strings can affect the algorithm’s performance. 
    more » « less
  5. The goal of the trace reconstruction problem is to recover a string x E {0, 1} given many independent traces of x, where a trace is a subsequence obtained from deleting bits of x independently with some given probability. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the state-of-the-art result of Chase (STOC 2021), which is based on statistics of arbitrary length-k subsequences, can also be obtained by considering the “k-mer statistics”, i.e., statistics regarding occurrences of contiguous k-bit strings (a.k.a, k-mers) in the initial string x, for k = Mazooji and Shomorony (ISIT 2023) show that such statistics (called k-mer density map) can be estimated within accuracy from poly(n, 2k, l/e) traces. We call an algorithm to be k-mer-based if it reconstructs x given estimates of the k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any k-mer-based algorithm for trace reconstruction must use exp n)) traces, under the assumption that the estimator requires poly(2k, 1 e) traces, thus establishing the optimality of this number of traces. Our analysis also shows that the analysis technique used by Chase is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n in the number of samples needed for an optimal algorithm, and show that this factor of n loss may be necessary under general “model estimation” settings. 
    more » « less