The goal of the trace reconstruction problem is to recover a string x E {0, 1} given many independent traces of x, where a trace is a subsequence obtained from deleting bits of x independently with some given probability. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the state-of-the-art result of Chase (STOC 2021), which is based on statistics of arbitrary length-k subsequences, can also be obtained by considering the “k-mer statistics”, i.e., statistics regarding occurrences of contiguous k-bit strings (a.k.a, k-mers) in the initial string x, for k = Mazooji and Shomorony (ISIT 2023) show that such statistics (called k-mer density map) can be estimated within accuracy from poly(n, 2k, l/e) traces. We call an algorithm to be k-mer-based if it reconstructs x given estimates of the k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any k-mer-based algorithm for trace reconstruction must use exp n)) traces, under the assumption that the estimator requires poly(2k, 1 e) traces, thus establishing the optimality of this number of traces. Our analysis also shows that the analysis technique used by Chase is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n in the number of samples needed for an optimal algorithm, and show that this factor of n loss may be necessary under general “model estimation” settings.
more »
« less
This content will become publicly available on July 7, 2025
On k-Mer-Based and Maximum Likelihood Estimation Algorithms for Trace Reconstruction
The goal of the trace reconstruction problem is to recover a string x E {0, 1} given many independent traces of x, where a trace is a subsequence obtained from deleting bits of x independently with some given probability. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the state-of-the-art result of Chase (STOC 2021), which is based on statistics of arbitrary length-k subsequences, can also be obtained by considering the “k-mer statistics”, i.e., statistics regarding occurrences of contiguous k-bit strings (a.k.a, k-mers) in the initial string x, for k = Mazooji and Shomorony (ISIT 2023) show that such statistics (called k-mer density map) can be estimated within accuracy from poly(n, 2k, l/e) traces. We call an algorithm to be k-mer-based if it reconstructs x given estimates of the k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any k-mer-based algorithm for trace reconstruction must use exp n)) traces, under the assumption that the estimator requires poly(2k, 1 e) traces, thus establishing the optimality of this number of traces. Our analysis also shows that the analysis technique used by Chase is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n in the number of samples needed for an optimal algorithm, and show that this factor of n loss may be necessary under general “model estimation” settings.
more »
« less
- Award ID(s):
- 2152413
- PAR ID:
- 10575297
- Publisher / Repository:
- IEEE Press
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Kumar, Amit; Ron-Zewi, Noga (Ed.)The goal of trace reconstruction is to reconstruct an unknown n-bit string x given only independent random traces of x, where a random trace of x is obtained by passing x through a deletion channel. A Statistical Query (SQ) algorithm for trace reconstruction is an algorithm which can only access statistical information about the distribution of random traces of x rather than individual traces themselves. Such an algorithm is said to be 𝓁-local if each of its statistical queries corresponds to an 𝓁-junta function over some block of 𝓁 consecutive bits in the trace. Since several - but not all - known algorithms for trace reconstruction fall under the local statistical query paradigm, it is interesting to understand the abilities and limitations of local SQ algorithms for trace reconstruction. In this paper we establish nearly-matching upper and lower bounds on local Statistical Query algorithms for both worst-case and average-case trace reconstruction. For the worst-case problem, we show that there is an Õ(n^{1/5})-local SQ algorithm that makes all its queries with tolerance τ ≥ 2^{-Õ(n^{1/5})}, and also that any Õ(n^{1/5})-local SQ algorithm must make some query with tolerance τ ≤ 2^{-Ω̃(n^{1/5})}. For the average-case problem, we show that there is an O(log n)-local SQ algorithm that makes all its queries with tolerance τ ≥ 1/poly(n), and also that any O(log n)-local SQ algorithm must make some query with tolerance τ ≤ 1/poly(n).more » « less
-
Abstract Tree trace reconstruction aims to learn the binary node labels of a tree, given independent samples of the tree passed through an appropriately defined deletion channel. In recent work, Davies, Rácz, and Rashtchian [10] used combinatorial methods to show that $$\exp({\mathrm{O}} (k \log_{k} n))$$ samples suffice to reconstruct a complete k -ary tree with n nodes with high probability. We provide an alternative proof of this result, which allows us to generalize it to a broader class of tree topologies and deletion models. In our proofs we introduce the notion of a subtrace, which enables us to connect with and generalize recent mean-based complex analytic algorithms for string trace reconstruction.more » « less
-
We study the problem of finding all $$k$$-periods of a length-$$n$$ string $$S$$, presented as a data stream. $$S$$ is said to have $$k$$-period $$p$$ if its prefix of length $n-p$ differs from its suffix of length $n-p$ in at most $$k$$ locations. We give a one-pass streaming algorithm that computes the $$k$$-periods of a string $$S$$ using $$\poly(k, \log n)$$ bits of space, for $$k$$-periods of length at most $$\frac{n}{2}$$. We also present a two-pass streaming algorithm that computes $$k$$-periods of $$S$$ using $$\poly(k, \log n)$$ bits of space, regardless of period length. We complement these results with comparable lower bounds.more » « less
-
Location estimation is one of the most basic questions in parametric statistics. Suppose we have a known distribution density f , and we get n i.i.d. samples from f (x − μ) for some unknown shift μ. The task is to estimate μ to high accuracy with high probability. The maximum likelihood estimator (MLE) is known to be asymptotically optimal as n → ∞, but what is possible for finite n? In this paper, we give two location estimators that are optimal under different criteria: 1) an estimator that has minimax-optimal estimation error subject to succeeding with probability 1 − ¶ and 2) a confidence interval estimator which, subject to its output interval containing μ with probability at least 1 − ¶, has the minimum expected squared interval width among all shift-invariant estimators. The latter construction can be generalized to minimizing the expectation of any loss function on the interval width.more » « less