skip to main content

Weighted Reservoir Sampling from Distributed Streams
We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights are equal, is well studied, and admits tight upper and lower bounds on message complexity. For weighted sampling with replacement, there is a simple reduction to unweighted sampling with replacement. However, in many applications the stream may have only a few heavy items which may dominate a random sample when chosen with replacement. Weighted sampling without replacement (weighted SWOR) eludes this issue, since such heavy items can be sampled at most once. In this work, we present the first message-optimal algorithm for weighted SWOR from a distributed stream. Our algorithm also has optimal space and time complexity. As an application of our algorithm for weighted SWOR, we derive the first distributed streaming algorithms for tracking heavy hitters with residual error. Here the goal is to identify stream items that contribute significantly to the residual stream, once the heaviest items are removed. Residual heavy hitters generalize the notion of $\ell_1$ heavy hitters and are important in streams that have a skewed distribution of weights. In more »
Authors:
; ; ;
Award ID(s):
Publication Date:
NSF-PAR ID:
10110906
Journal Name:
Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
Page Range or eLocation-ID:
218 to 235
Sponsoring Org:
National Science Foundation
##### More Like this
1. There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a black-box adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the white-box adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1-heavy hitters problem that outperforms the optimal deterministic Misra-Gries algorithm on long streams. If the white-box adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1-heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in amore »
2. We initiate the study of biologically-inspired spiking neural networks from the perspective of streaming algorithms. Like computers, human brains face memory limitations, which pose a significant obstacle when processing large scale and dynamically changing data. In computer science, these challenges are captured by the well-known streaming model, which can be traced back to Munro and Paterson `78 and has had significant impact in theory and beyond. In the classical streaming setting, one must compute a function f of a stream of updates 𝒮 = {u₁,…,u_m}, given restricted single-pass access to the stream. The primary complexity measure is the space used by the algorithm. In contrast to the large body of work on streaming algorithms, relatively little is known about the computational aspects of data processing in spiking neural networks. In this work, we seek to connect these two models, leveraging techniques developed for streaming algorithms to better understand neural computation. Our primary goal is to design networks for various computational tasks using as few auxiliary (non-input or output) neurons as possible. The number of auxiliary neurons can be thought of as the "space" required by the network. Previous algorithmic work in spiking neural networks has many similarities with streaming algorithms.more »
3. The Sparsest Cut is a fundamental optimization problem that have been extensively studied. For planar inputs the problem is in P and can be solved in Õ(n 3 ) time if all vertex weights are 1. Despite a significant amount of effort, the best algorithms date back to the early 90’s and can only achieve O(log n)-approximation in Õ(n) time or 3.5-approximation in Õ(n 2 ) time [Rao, STOC92]. Our main result is an Ω(n 2−ε ) lower bound for Sparsest Cut even in planar graphs with unit vertex weights, under the (min, +)-Convolution conjecture, showing that approxima- tions are inevitable in the near-linear time regime. To complement the lower bound, we provide a 3.3-approximation in near-linear time, improving upon the 25-year old result of Rao in both time and accuracy. We also show that our lower bound is not far from optimal by observing an exact algorithm with running time Õ(n 5/2 ) improving upon the Õ(n 3 ) algorithm of Park and Phillips [STOC93]. Our lower bound accomplishes a repeatedly raised challenge by being the first fine-grained lower bound for a natural planar graph problem in P. Building on our construction we prove near-quadratic lower bounds under SETHmore »
4. In the unit-cost comparison model, a black box takes an input two items and outputs the result of the comparison. Problems like sorting and searching have been studied in this model, and it has been general- ized to include the concept of priced information, where different pairs of items (say database records) have different comparison costs. These comparison costs can be arbitrary (in which case no algorithm can be close to optimal (Charikar et al. STOC 2000)), structured (for exam- ple, the comparison cost may depend on the length of the databases (Gupta et al. FOCS 2001)), or stochastic (Angelov et al. LATIN 2008). Motivated by the database setting where the cost depends on the sizes of the items, we consider the problems of sorting and batched predecessor where two non-uniform sets of items A and B are given as input. (1) In the RAM setting, we consider the scenario where both sets have n keys each. The cost to compare two items in A is a, to compare an item of A to an item of B is b, and to compare two items in B is c. We give upper and lower bounds for the case a ≤more »
5. Given an input stream of size N , a -heavy hiter is an item that occurs at least N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = N - th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space ((N ) words). Thus in-RAM heavy-hitters algorithms typically sacrfice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to exter- nal memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeli- ness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between report- ing delay and I/O overhead. With a small bounded reporting delay, ourmore »