The goal of the trace reconstruction problem is to recover a string x E {0, 1} given many independent traces of x, where a trace is a subsequence obtained from deleting bits of x independently with some given probability. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the state-of-the-art result of Chase (STOC 2021), which is based on statistics of arbitrary length-k subsequences, can also be obtained by considering the “k-mer statistics”, i.e., statistics regarding occurrences of contiguous k-bit strings (a.k.a, k-mers) in the initial string x, for k = Mazooji and Shomorony (ISIT 2023) show that such statistics (called k-mer density map) can be estimated within accuracy from poly(n, 2k, l/e) traces. We call an algorithm to be k-mer-based if it reconstructs x given estimates of the k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any k-mer-based algorithm for trace reconstruction must use exp n)) traces, under the assumption that the estimator requires poly(2k, 1 e) traces, thus establishing the optimality of this number of traces. Our analysis also shows that the analysis technique used by Chase is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n in the number of samples needed for an optimal algorithm, and show that this factor of n loss may be necessary under general “model estimation” settings.
more »
« less
Trace Reconstruction from Local Statistical Queries
The goal of trace reconstruction is to reconstruct an unknown n-bit string x given only independent random traces of x, where a random trace of x is obtained by passing x through a deletion channel. A Statistical Query (SQ) algorithm for trace reconstruction is an algorithm which can only access statistical information about the distribution of random traces of x rather than individual traces themselves. Such an algorithm is said to be 𝓁-local if each of its statistical queries corresponds to an 𝓁-junta function over some block of 𝓁 consecutive bits in the trace. Since several - but not all - known algorithms for trace reconstruction fall under the local statistical query paradigm, it is interesting to understand the abilities and limitations of local SQ algorithms for trace reconstruction. In this paper we establish nearly-matching upper and lower bounds on local Statistical Query algorithms for both worst-case and average-case trace reconstruction. For the worst-case problem, we show that there is an Õ(n^{1/5})-local SQ algorithm that makes all its queries with tolerance τ ≥ 2^{-Õ(n^{1/5})}, and also that any Õ(n^{1/5})-local SQ algorithm must make some query with tolerance τ ≤ 2^{-Ω̃(n^{1/5})}. For the average-case problem, we show that there is an O(log n)-local SQ algorithm that makes all its queries with tolerance τ ≥ 1/poly(n), and also that any O(log n)-local SQ algorithm must make some query with tolerance τ ≤ 1/poly(n).
more »
« less
- PAR ID:
- 10583833
- Editor(s):
- Kumar, Amit; Ron-Zewi, Noga
- Publisher / Repository:
- Schloss Dagstuhl – Leibniz-Zentrum für Informatik
- Date Published:
- Volume:
- 317
- ISSN:
- 1868-8969
- ISBN:
- 978-3-95977-348-5
- Page Range / eLocation ID:
- 52:1-52:24
- Subject(s) / Keyword(s):
- trace reconstruction statistical queries algorithmic statistics Mathematics of computing → Probabilistic inference problems
- Format(s):
- Medium: X Size: 24 pages; 997028 bytes Other: application/pdf
- Size(s):
- 24 pages 997028 bytes
- Right(s):
- Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The goal of the trace reconstruction problem is to recover a string x E {0, 1} given many independent traces of x, where a trace is a subsequence obtained from deleting bits of x independently with some given probability. In this paper we consider two kinds of algorithms for the trace reconstruction problem. We first observe that the state-of-the-art result of Chase (STOC 2021), which is based on statistics of arbitrary length-k subsequences, can also be obtained by considering the “k-mer statistics”, i.e., statistics regarding occurrences of contiguous k-bit strings (a.k.a, k-mers) in the initial string x, for k = Mazooji and Shomorony (ISIT 2023) show that such statistics (called k-mer density map) can be estimated within accuracy from poly(n, 2k, l/e) traces. We call an algorithm to be k-mer-based if it reconstructs x given estimates of the k-mer density map. Such algorithms essentially capture all the analyses in the worst-case and smoothed-complexity models of the trace reconstruction problem we know of so far. Our first, and technically more involved, result shows that any k-mer-based algorithm for trace reconstruction must use exp n)) traces, under the assumption that the estimator requires poly(2k, 1 e) traces, thus establishing the optimality of this number of traces. Our analysis also shows that the analysis technique used by Chase is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, i.e., up to a factor of n in the number of samples needed for an optimal algorithm, and show that this factor of n loss may be necessary under general “model estimation” settings.more » « less
-
Interval scheduling is a basic problem in the theory of algorithms and a classical task in combinatorial optimization. We develop a set of techniques for partitioning and grouping jobs based on their starting and ending times, that enable us to view an instance of interval scheduling on many jobs as a union of multiple interval scheduling instances, each containing only a few jobs. Instantiating these techniques in dynamic and local settings of computation leads to several new results. For (1+ε)-approximation of job scheduling of n jobs on a single machine, we develop a fully dynamic algorithm with O((log n)/ε) update and O(log n) query worst-case time. Further, we design a local computation algorithm that uses only O((log N)/ε) queries when all jobs are length at least 1 and have starting/ending times within [0,N]. Our techniques are also applicable in a setting where jobs have rewards/weights. For this case we design a fully dynamic deterministic algorithm whose worst-case update and query time are poly(log n,1/ε). Equivalently, this is the first algorithm that maintains a (1+ε)-approximation of the maximum independent set of a collection of weighted intervals in poly(log n,1/ε) time updates/queries. This is an exponential improvement in 1/ε over the running time of a randomized algorithm of Henzinger, Neumann, and Wiese [SoCG, 2020], while also removing all dependence on the values of the jobs' starting/ending times and rewards, as well as removing the need for any randomness. We also extend our approaches for interval scheduling on a single machine to examine the setting with M machines.more » « less
-
Interval scheduling is a basic problem in the theory of algorithms and a classical task in combinatorial optimization. We develop a set of techniques for partitioning and grouping jobs based on their starting and ending times, that enable us to view an instance of interval scheduling on many jobs as a union of multiple interval scheduling instances, each containing only a few jobs. Instantiating these techniques in dynamic and local settings of computation leads to several new results. For (1+ε)-approximation of job scheduling of n jobs on a single machine, we develop a fully dynamic algorithm with O((log n)/ε) update and O(log n) query worst-case time. Further, we design a local computation algorithm that uses only O((log N)/ε) queries when all jobs are length at least 1 and have starting/ending times within [0,N]. Our techniques are also applicable in a setting where jobs have rewards/weights. For this case we design a fully dynamic deterministic algorithm whose worst-case update and query time are poly(log n,1/ε). Equivalently, this is the first algorithm that maintains a (1+ε)-approximation of the maximum independent set of a collection of weighted intervals in poly(log n,1/ε) time updates/queries. This is an exponential improvement in 1/ε over the running time of a randomized algorithm of Henzinger, Neumann, and Wiese [SoCG, 2020], while also removing all dependence on the values of the jobs' starting/ending times and rewards, as well as removing the need for any randomness. We also extend our approaches for interval scheduling on a single machine to examine the setting with M machines.more » « less
-
Santhanam, Rahul (Ed.)Adversarially robust streaming algorithms are required to process a stream of elements and produce correct outputs, even when each stream element can be chosen as a function of earlier algorithm outputs. As with classic streaming algorithms, which must only be correct for the worst-case fixed stream, adversarially robust algorithms with access to randomness can use significantly less space than deterministic algorithms. We prove that for the Missing Item Finding problem in streaming, the space complexity also significantly depends on how adversarially robust algorithms are permitted to use randomness. (In contrast, the space complexity of classic streaming algorithms does not depend as strongly on the way randomness is used.) For Missing Item Finding on streams of length 𝓁 with elements in {1,…,n}, and ≤ 1/poly(𝓁) error, we show that when 𝓁 = O(2^√{log n}), "random seed" adversarially robust algorithms, which only use randomness at initialization, require 𝓁^Ω(1) bits of space, while "random tape" adversarially robust algorithms, which may make random decisions at any time, may use O(polylog(𝓁)) random bits. When 𝓁 is between n^Ω(1) and O(√n), "random tape" adversarially robust algorithms need 𝓁^Ω(1) space, while "random oracle" adversarially robust algorithms, which can read from a long random string for free, may use O(polylog(𝓁)) space. The space lower bound for the "random seed" case follows, by a reduction given in prior work, from a lower bound for pseudo-deterministic streaming algorithms given in this paper.more » « less