skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


Title: Space-efficient Query Evaluation over Probabilistic Event Streams
Real-time decision making in IoT applications relies upon space-efficient evaluation of queries over streaming data. To model the uncertainty in the classification of data being processed, we consider the model of probabilistic strings --- sequences of discrete probability distributions over a finite set of events, and initiate the study of space complexity of streaming computation for different classes of queries over such probabilistic strings. We first consider the problem of computing the probability that a word, sampled from the distribution defined by the probabilistic string read so far, is accepted by a given deterministic finite automaton. We show that this regular pattern matching problem can be solved using space that is only poly-logarithmic in the string length (and polynomial in the size of the DFA) if we are allowed a multiplicative approximation error. Then we show how to generalize this result to quantitative queries specified by additive cost register automata --- these are automata that map strings to numerical values using finite control and registers that get updated using linear transformations. Finally, we consider the case when updates in such an automaton involve tests, and in particular, when there is a counter variable that can be either incremented or decremented but decrements only apply when the counter value is non-zero. In this case, the desired answer depends on the probability distribution over the set of possible counter values that can range from 0 to n for a string of length n. Under a mild assumption, namely probabilities of the individual events are bounded away from 0 and 1, we show that there is an algorithm that can compute all n entries of this probability distribution vector to within additive 1/poly(n) error using space that is only Õ(n). In establishing these results, we introduce several new technical ideas that may prove useful for designing space-efficient algorithms for other query models over probabilistic strings.  more » « less
Award ID(s):
1763514
PAR ID:
10163320
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
LICS'20: 35th Annual ACM/IEEE Symposium on Logic in Computer Science
Page Range / eLocation ID:
74 to 87
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Etessami, Kousha ; Feige, Uriel ; Puppis, Gabriele (Ed.)
    Many streaming algorithms provide only a high-probability relative approximation. These two relaxations, of allowing approximation and randomization, seem necessary - for many streaming problems, both relaxations must be employed simultaneously, to avoid an exponentially larger (and often trivial) space complexity. A common drawback of these randomized approximate algorithms is that independent executions on the same input have different outputs, that depend on their random coins. Pseudo-deterministic algorithms combat this issue, and for every input, they output with high probability the same "canonical" solution. We consider perhaps the most basic problem in data streams, of counting the number of items in a stream of length at most n. Morris’s counter [CACM, 1978] is a randomized approximation algorithm for this problem that uses O(log log n) bits of space, for every fixed approximation factor (greater than 1). Goldwasser, Grossman, Mohanty and Woodruff [ITCS 2020] asked whether pseudo-deterministic approximation algorithms can match this space complexity. Our main result answers their question negatively, and shows that such algorithms must use Ω(√{log n / log log n}) bits of space. Our approach is based on a problem that we call Shift Finding, and may be of independent interest. In this problem, one has query access to a shifted version of a known string F ∈ {0,1}^{3n}, which is guaranteed to start with n zeros and end with n ones, and the goal is to find the unknown shift using a small number of queries. We provide for this problem an algorithm that uses O(√n) queries. It remains open whether poly(log n) queries suffice; if true, then our techniques immediately imply a nearly-tight Ω(log n/log log n) space bound for pseudo-deterministic approximate counting. 
    more » « less
  2. We study the problem of finding all $k$-periods of a length-$n$ string $S$, presented as a data stream. $S$ is said to have $k$-period $p$ if its prefix of length $n-p$ differs from its suffix of length $n-p$ in at most $k$ locations. We give a one-pass streaming algorithm that computes the $k$-periods of a string $S$ using $\poly(k, \log n)$ bits of space, for $k$-periods of length at most $\frac{n}{2}$. We also present a two-pass streaming algorithm that computes $k$-periods of $S$ using $\poly(k, \log n)$ bits of space, regardless of period length. We complement these results with comparable lower bounds. 
    more » « less
  3. Streaming codes take a string of source symbols as input and output a string of coded symbols in real time, which effectively eliminate the queueing delay and are regarded as a promising scheme for low latency communications. Aiming at quantifying the fundamental latency performance of random linear streaming codes (RLSCs) over i.i.d. symbol erasure channels, this work derives the exact error probability under, simultaneously, the finite memory length and finite decoding deadline constraints. The result is then used to examine the tradeoff among memory length (complexity), decoding deadline (delay), and error probability (reliability) of RLSCs for the first time in the literature. Two critical observations are made: (i) Too much memory can adversely impact the performance under a finite decoding deadline constraint, a surprising finding not captured by the traditional wisdom that large memory length monotonically improves the performance in the asymptotic regime; (ii) The end-to-end delay of the RLSC is roughly 50% of that of the MDS block code when under identical code rate and error probability requirements. This implies that switching from block codes to RLSCs not only eliminates the queueing delay (thus 50%) but also has little negative impact on the error probability. 
    more » « less
  4. Finite State Automata are widely used to accelerate pattern matching in many emerging application domains like DNA sequencing and XML parsing. Conventional CPUs and compute-centric accelerators are bottlenecked by memory bandwidth and irregular memory access patterns in automata processing. We present Cache Automaton, which repurposes last-level cache for automata processing, and a compiler that automates the process of mapping large real world Non-Deterministic Finite Automata (NFAs) to the proposed architecture. Cache Automaton extends a conventional last-level cache architecture with components to accelerate two phases in NFA processing: state-match and state-transition. State-matching is made efficient using a sense-amplifier cycling technique that exploits spatial locality in symbol matches. State-transition is made efficient using a new compact switch architecture. By overlapping these two phases for adjacent symbols we realize an efficient pipelined design. We evaluate two designs, one optimized for performance and the other optimized for space, across a set of 20 diverse benchmarks. The performance optimized design provides a speedup of 15× over DRAM-based Micron's Automata Processor and 3840× speedup over processing in a conventional x86 CPU. The proposed design utilizes on an average 1.2MB of cache space across benchmarks, while consuming 2.3nJ of energy per input symbol. Our space optimized design can reduce the cache utilization to 0.72MB, while still providing a speedup of 9× over AP. 
    more » « less
  5. Santhanam, Rahul (Ed.)
    Adversarially robust streaming algorithms are required to process a stream of elements and produce correct outputs, even when each stream element can be chosen as a function of earlier algorithm outputs. As with classic streaming algorithms, which must only be correct for the worst-case fixed stream, adversarially robust algorithms with access to randomness can use significantly less space than deterministic algorithms. We prove that for the Missing Item Finding problem in streaming, the space complexity also significantly depends on how adversarially robust algorithms are permitted to use randomness. (In contrast, the space complexity of classic streaming algorithms does not depend as strongly on the way randomness is used.) For Missing Item Finding on streams of length 𝓁 with elements in {1,…,n}, and ≤ 1/poly(𝓁) error, we show that when 𝓁 = O(2^√{log n}), "random seed" adversarially robust algorithms, which only use randomness at initialization, require 𝓁^Ω(1) bits of space, while "random tape" adversarially robust algorithms, which may make random decisions at any time, may use O(polylog(𝓁)) random bits. When 𝓁 is between n^Ω(1) and O(√n), "random tape" adversarially robust algorithms need 𝓁^Ω(1) space, while "random oracle" adversarially robust algorithms, which can read from a long random string for free, may use O(polylog(𝓁)) space. The space lower bound for the "random seed" case follows, by a reduction given in prior work, from a lower bound for pseudo-deterministic streaming algorithms given in this paper. 
    more » « less