Given an input stream S of size N , a ɸheavy hitter is an item that occurs at least ɸN times in S . The problem of finding heavyhitters is extensively studied in the database literature. We study a realtime heavyhitters variant in which an element must be reported shortly after we see its T = ɸ Nth occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection ( TED ) Problem. The TED problem models the needs of many realworld monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, highspeed streams with a low reporting threshold (high sensitivity). Like the classic heavyhitters problem, solving the TED problem without falsepositives requires large space (Ω (N) words). Thus inRAM heavyhitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavyhitters algorithms to external memory to solve the TED problem on large highspeed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/Obandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multithreaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavyhitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., ≈100K observations per second.
more »
« less
Timely Reporting of Heavy Hitters using External Memory
Given an input stream of size N , a heavy hiter is an item that occurs at least N times in S. The problem of finding heavyhitters is extensively studied in the database literature.
We study a realtime heavyhitters variant in which an element must be reported shortly after we see its T = N  th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many realworld monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, highspeed streams, and with a low reporting threshold (high sensitivity).
Like the classic heavyhitters problem, solving the TED problem without falsepositives requires large space ((N ) words). Thus inRAM heavyhitters algorithms typically sacrfice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).
We show how to adapt heavyhitters algorithms to exter nal memory to solve the TED problem on large highspeed streams while guaranteeing accuracy, sensitivity, and timeli ness. Our data structures are limited only by I/Obandwidth (not latency) and support a tunable tradeoff between report ing delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.
We implement and validate our data structures empirically using the Firehose streaming benchmark. Multithreaded ver sions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavyhitters algorithm to external memory would be limited by the storage device’s random I/O throughput, i.e., approx 100K observations per second.
more »
« less
 NSFPAR ID:
 10298566
 Date Published:
 Journal Name:
 SIGMOD record
 ISSN:
 19435835
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)The problem of recovering heavy components of a highdimensional vector from compressed data is of great interest in broad applications, such as feature extraction under scarce computing memory and distributed learning under limited bandwidth. Recently, a compression algorithm called count sketch has gained wide popularity to recover heavy components in various fields. In this paper, we carefully analyze count sketch and illustrate that its default recovery method, namely median filtering, has a distinct error pattern of reporting false positives. To counteract this error pattern, we propose a new scheme called zero checking which adopts a twostep recovery approach to improve the probability of detecting false positives. Our proposed technique builds on rigorous error analysis, which enables us to optimize the selection of a key design parameter for maximum performance gain. The empirical results show that our scheme achieves better recovery accuracy than median filtering and requires less samples to accurately recover heavy components.more » « less

There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a blackbox adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the whitebox adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1heavy hitters problem that outperforms the optimal deterministic MisraGries algorithm on long streams. If the whitebox adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any twoplayer deterministic communication lower bound to a lower bound for randomized algorithms robust to a whitebox adversary. In particular, our results show that for all p ≥ 0, there exists a constant Cp > 1 such that any Cpapproximation algorithm for Fp moment estimation in insertiononly streams with a whitebox adversary requires Ω(n) space for a universe of size n. Similarly, there is a constant C > 1 such that any Capproximation algorithm in an insertiononly stream for matrix rank requires Ω(n) space with a whitebox adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of Ω(log n) bits for the fundamental problem of deterministic approximate counting in a stream of 0’s and 1’s, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the whitebox space complexity of a streaming algorithmmore » « less

The data management of large companies often prioritize more recent data, as a source of higher accuracy prediction than outdated data. For example, the Facebook data policy retains user search histories for months while the Google data retention policy states that browser information may be stored for up to months. These policies are captured by the sliding window model, in which only the most recent statistics form the underlying dataset. In this paper, we consider the problem of privately releasing the L2heavy hitters in the sliding window model, which include Lpheavy hitters for p<=2 and in some sense are the strongest possible guarantees that can be achieved using polylogarithmic space, but cannot be handled by existing techniques due to the subadditivity of the L2 norm. Moreover, existing nonprivate sliding window algorithms use the smooth histogram framework, which has high sensitivity. To overcome these barriers, we introduce the first differentially private algorithm for L2heavy hitters in the sliding window model by initiating a number of L2heavy hitter algorithms across the stream with significantly lower threshold. Similarly, we augment the algorithms with an approximate frequency tracking algorithm with significantly higher accuracy. We then use smooth sensitivity and statistical distance arguments to show that we can add noise proportional to an estimation of the norm. To the best of our knowledge, our techniques are the first to privately release statistics that are related to a subadditive function in the sliding window model, and may be of independent interest to future differentially private algorithmic design in the sliding window model.more » « less

System operators are often interested in extracting different feature streams from multidimensional data streams; and reporting their distributions at regular intervals, including the heavy hitters that contribute to the tail portion of the feature distribution. Satisfying these requirements to increase data rates with limited resources is challenging. This paper presents the design and implementation of Panakos that makes the best use of available resources to report a given feature's distribution accurately, its tail contributors, and other stream statistics (e.g., cardinality, entropy, etc.). Our key idea is to leverage the skewness inherent to most feature streams in the real world. We leverage this skewness by disentangling the feature stream into hot, warm, and cold items based on their feature values. We then use different data structures for tracking objects in each category. Panakos provides solid theoretical guarantees and achieves high performance for various tasks. We have implemented Panakos on both software and hardware and compared Panakos to other stateoftheart sketches using synthetic and realworld datasets. The experimental results demonstrate that Panakos often achieves one order of magnitude better accuracy than the stateoftheart solutions for a given memory budget.more » « less