skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Streaming Periodicity with Mismatches
We study the problem of finding all $$k$$-periods of a length-$$n$$ string $$S$$, presented as a data stream. $$S$$ is said to have $$k$$-period $$p$$ if its prefix of length $n-p$ differs from its suffix of length $n-p$ in at most $$k$$ locations. We give a one-pass streaming algorithm that computes the $$k$$-periods of a string $$S$$ using $$\poly(k, \log n)$$ bits of space, for $$k$$-periods of length at most $$\frac{n}{2}$$. We also present a two-pass streaming algorithm that computes $$k$$-periods of $$S$$ using $$\poly(k, \log n)$$ bits of space, regardless of period length. We complement these results with comparable lower bounds.  more » « less
Award ID(s):
1649515
PAR ID:
10033555
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
RANDOM
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Etessami, Kousha; Feige, Uriel; Puppis, Gabriele (Ed.)
    Many streaming algorithms provide only a high-probability relative approximation. These two relaxations, of allowing approximation and randomization, seem necessary - for many streaming problems, both relaxations must be employed simultaneously, to avoid an exponentially larger (and often trivial) space complexity. A common drawback of these randomized approximate algorithms is that independent executions on the same input have different outputs, that depend on their random coins. Pseudo-deterministic algorithms combat this issue, and for every input, they output with high probability the same "canonical" solution. We consider perhaps the most basic problem in data streams, of counting the number of items in a stream of length at most n. Morris’s counter [CACM, 1978] is a randomized approximation algorithm for this problem that uses O(log log n) bits of space, for every fixed approximation factor (greater than 1). Goldwasser, Grossman, Mohanty and Woodruff [ITCS 2020] asked whether pseudo-deterministic approximation algorithms can match this space complexity. Our main result answers their question negatively, and shows that such algorithms must use Ω(√{log n / log log n}) bits of space. Our approach is based on a problem that we call Shift Finding, and may be of independent interest. In this problem, one has query access to a shifted version of a known string F ∈ {0,1}^{3n}, which is guaranteed to start with n zeros and end with n ones, and the goal is to find the unknown shift using a small number of queries. We provide for this problem an algorithm that uses O(√n) queries. It remains open whether poly(log n) queries suffice; if true, then our techniques immediately imply a nearly-tight Ω(log n/log log n) space bound for pseudo-deterministic approximate counting. 
    more » « less
  2. In a recent breakthrough, Paz and Schwartzman (SODA'17) presented a single-pass (2+epsilon)-approximation algorithm for the maximum weight matching problem in the semi-streaming model. Their algorithm uses O(n log^2 n) bits of space, for any constant epsilon>0. We present a simplified and more intuitive primal-dual analysis, for essentially the same algorithm, which also improves the space complexity to the optimal bound of O(n log n) bits - this is optimal as the output matching requires Omega(n log n) bits. 
    more » « less
  3. Santhanam, Rahul (Ed.)
    Adversarially robust streaming algorithms are required to process a stream of elements and produce correct outputs, even when each stream element can be chosen as a function of earlier algorithm outputs. As with classic streaming algorithms, which must only be correct for the worst-case fixed stream, adversarially robust algorithms with access to randomness can use significantly less space than deterministic algorithms. We prove that for the Missing Item Finding problem in streaming, the space complexity also significantly depends on how adversarially robust algorithms are permitted to use randomness. (In contrast, the space complexity of classic streaming algorithms does not depend as strongly on the way randomness is used.) For Missing Item Finding on streams of length 𝓁 with elements in {1,…,n}, and ≤ 1/poly(𝓁) error, we show that when 𝓁 = O(2^√{log n}), "random seed" adversarially robust algorithms, which only use randomness at initialization, require 𝓁^Ω(1) bits of space, while "random tape" adversarially robust algorithms, which may make random decisions at any time, may use O(polylog(𝓁)) random bits. When 𝓁 is between n^Ω(1) and O(√n), "random tape" adversarially robust algorithms need 𝓁^Ω(1) space, while "random oracle" adversarially robust algorithms, which can read from a long random string for free, may use O(polylog(𝓁)) space. The space lower bound for the "random seed" case follows, by a reduction given in prior work, from a lower bound for pseudo-deterministic streaming algorithms given in this paper. 
    more » « less
  4. Bojańczyk, Mikołaj; Merelli, Emanuela; Woodruff, David P (Ed.)
    Given n points in 𝓁_p^d, we consider the problem of partitioning points into k clusters with associated centers. The cost of a clustering is the sum of p-th powers of distances of points to their cluster centers. For p ∈ [1,2], we design sketches of size poly(log(nd),k,1/ε) such that the cost of the optimal clustering can be estimated to within factor 1+ε, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly(log(nd),k,1/ε). We also obtain a distributed memory algorithm, where the n points are arbitrarily partitioned amongst m machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on d for p ∈ [1,2). 
    more » « less
  5. Many problems on data streams have been studied at two extremes of difficulty: either allowing randomized algorithms, in the static setting (where they should err with bounded probability on the worst case stream); or when only deterministic and infallible algorithms are required. Some recent works have considered the adversarial setting, in which a randomized streaming algorithm must succeed even on data streams provided by an adaptive adversary that can see the intermediate outputs of the algorithm. In order to better understand the differences between these models, we study a streaming task called “Missing Item Finding”. In this problem, for r < n, one is given a data stream a1 , . . . , ar of elements in [n], (possibly with repetitions), and must output some x ∈ [n] which does not equal any of the ai. We prove that, for r = nΘ(1) and δ = 1/poly(n), the space required for randomized algorithms that solve this problem in the static setting with error δ is Θ(polylog(n)); for algorithms in the adversarial setting with error δ, Θ((1 + r2/n)polylog(n)); and for deterministic algorithms, Θ(r/polylog(n)). Because our adversarially robust algorithm relies on free access to a string of O(r log n) random bits, we investigate a “random start” model of streaming algorithms where all random bits used are included in the space cost. Here we find a conditional lower bound on the space usage, which depends on the space that would be needed for a pseudo-deterministic algorithm to solve the problem. We also prove an Ω(r/polylog(n)) lower bound for the space needed by a streaming algorithm with < 1/2polylog(n) error against “white-box” adversaries that can see the internal state of the algorithm, but not predict its future random decisions. 
    more » « less