There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a blackbox adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the whitebox adversarial model, where the stream is
chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1heavy hitters problem that outperforms the optimal deterministic MisraGries algorithm on long streams. If the whitebox adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1heavy hitters algorithm even further
and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any twoplayer deterministic communication lower bound to a lower bound for randomized algorithms robust to a whitebox adversary. In particular, our results show that for all p ≥ 0, there exists a constant Cp > 1 such that any
Cpapproximation algorithm for Fp moment estimation in insertiononly streams with a whitebox adversary requires Ω(n) space for a universe of size n. Similarly, there is a constant C > 1 such that any Capproximation algorithm in an insertiononly stream for matrix rank requires Ω(n) space with a whitebox adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of Ω(log n) bits for the fundamental problem of deterministic approximate counting in a stream of 0’s and 1’s, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the whitebox space complexity of a streaming algorithm
more »
« less
Weighted Reservoir Sampling from Distributed Streams
We consider messageefficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream may have only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling without replacement (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first messageoptimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking heavy hitters with residual error. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of $\ell_1$ heavy hitters and are important in streams that have a skewed distribution of weights. In addition to the upper bound, we also provide a lower bound on the message complexity that is nearly tight up to a $łog(1/\eps)$ factor. Finally, we use our weighted sampling algorithm to improve the message complexity of distributed $L_1$ tracking, also known as count tracking, which is a widely studied problem in distributed streaming. We also derive a tight message lower bound, which closes the message complexity of this fundamental problem.
more »
« less
 NSFPAR ID:
 10110906
 Date Published:
 Journal Name:
 Proceedings of the 38th ACM SIGMODSIGACTSIGAI Symposium on Principles of Database Systems
 Page Range / eLocation ID:
 218 to 235
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Given an input stream S of size N , a ɸheavy hitter is an item that occurs at least ɸN times in S . The problem of finding heavyhitters is extensively studied in the database literature. We study a realtime heavyhitters variant in which an element must be reported shortly after we see its T = ɸ Nth occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection ( TED ) Problem. The TED problem models the needs of many realworld monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, highspeed streams with a low reporting threshold (high sensitivity). Like the classic heavyhitters problem, solving the TED problem without falsepositives requires large space (Ω (N) words). Thus inRAM heavyhitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavyhitters algorithms to external memory to solve the TED problem on large highspeed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/Obandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multithreaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavyhitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.more » « less

Given an input stream of size N , a heavy hiter is an item that occurs at least N times in S. The problem of finding heavyhitters is extensively studied in the database literature. We study a realtime heavyhitters variant in which an element must be reported shortly after we see its T = N  th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many realworld monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, highspeed streams, and with a low reporting threshold (high sensitivity). Like the classic heavyhitters problem, solving the TED problem without falsepositives requires large space ((N ) words). Thus inRAM heavyhitters algorithms typically sacrfice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavyhitters algorithms to exter nal memory to solve the TED problem on large highspeed streams while guaranteeing accuracy, sensitivity, and timeli ness. Our data structures are limited only by I/Obandwidth (not latency) and support a tunable tradeoff between report ing delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multithreaded ver sions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavyhitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., approx 100K observations per second.more » « less

We initiate the study of biologicallyinspired spiking neural networks from the perspective of streaming algorithms. Like computers, human brains face memory limitations, which pose a significant obstacle when processing large scale and dynamically changing data. In computer science, these challenges are captured by the wellknown streaming model, which can be traced back to Munro and Paterson `78 and has had significant impact in theory and beyond. In the classical streaming setting, one must compute a function f of a stream of updates 𝒮 = {u₁,…,u_m}, given restricted singlepass access to the stream. The primary complexity measure is the space used by the algorithm. In contrast to the large body of work on streaming algorithms, relatively little is known about the computational aspects of data processing in spiking neural networks. In this work, we seek to connect these two models, leveraging techniques developed for streaming algorithms to better understand neural computation. Our primary goal is to design networks for various computational tasks using as few auxiliary (noninput or output) neurons as possible. The number of auxiliary neurons can be thought of as the "space" required by the network. Previous algorithmic work in spiking neural networks has many similarities with streaming algorithms. However, the connection between these two spacelimited models has not been formally addressed. We take the first steps towards understanding this connection. On the upper bound side, we design neural algorithms based on known streaming algorithms for fundamental tasks, including distinct elements, approximate median, and heavy hitters. The number of neurons in our solutions almost match the space bounds of the corresponding streaming algorithms. As a general algorithmic primitive, we show how to implement the important streaming technique of linear sketching efficiently in spiking neural networks. On the lower bound side, we give a generic reduction, showing that any spaceefficient spiking neural network can be simulated by a spaceefficient streaming algorithm. This reduction lets us translate streamingspace lower bounds into nearly matching neuralspace lower bounds, establishing a close connection between the two models.more » « less

null (Ed.)The Sparsest Cut is a fundamental optimization problem that have been extensively studied. For planar inputs the problem is in P and can be solved in Õ(n 3 ) time if all vertex weights are 1. Despite a significant amount of effort, the best algorithms date back to the early 90’s and can only achieve O(log n)approximation in Õ(n) time or 3.5approximation in Õ(n 2 ) time [Rao, STOC92]. Our main result is an Ω(n 2−ε ) lower bound for Sparsest Cut even in planar graphs with unit vertex weights, under the (min, +)Convolution conjecture, showing that approxima tions are inevitable in the nearlinear time regime. To complement the lower bound, we provide a 3.3approximation in nearlinear time, improving upon the 25year old result of Rao in both time and accuracy. We also show that our lower bound is not far from optimal by observing an exact algorithm with running time Õ(n 5/2 ) improving upon the Õ(n 3 ) algorithm of Park and Phillips [STOC93]. Our lower bound accomplishes a repeatedly raised challenge by being the first finegrained lower bound for a natural planar graph problem in P. Building on our construction we prove nearquadratic lower bounds under SETH for variants of the closest pair problem in planar graphs, and use them to show that the popular AverageLinkage procedure for Hierarchical Clustering cannot be simulated in truly subquadratic time. At the core of our constructions is a diamondlike gadget that also settles the complexity of Diameter in distributed planar networks. We prove an Ω(n/ log n) lower bound on the number of communication rounds required to compute the weighted diameter of a network in the CONGET model, even when the underlying graph is planar and all nodes are D = 4 hops away from each other. This is the first poly(n) lower bound in the planardistributed setting, and it complements the recent poly(D, log n) upper bounds of Li and Parter [STOC 2019] for (exact) unweighted diameter and for (1 + ε) approximate weighted diameter.more » « less