Given an input stream S of size N , a ɸheavy hitter is an item that occurs at least ɸN times in S . The problem of finding heavyhitters is extensively studied in the database literature. We study a realtime heavyhitters variant in which an element must be reported shortly after we see its T = ɸ Nth occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection ( TED ) Problem. The TED problem models the needs of many realworld monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, highspeed streams with a low reporting threshold (high sensitivity). Like the classic heavyhitters problem, solving the TED problem without falsepositives requires large space (Ω (N) words). Thus inRAM heavyhitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavyhitters algorithms to external memory to solve the TED problem on large highspeed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/Obandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, ourmore »
Timely Reporting of Heavy Hitters using External Memory
Given an input stream of size N , a heavy hiter is an item that occurs at least N times in S. The problem of finding heavyhitters is extensively studied in the database literature.
We study a realtime heavyhitters variant in which an element must be reported shortly after we see its T = N  th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many realworld monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, highspeed streams, and with a low reporting threshold (high sensitivity).
Like the classic heavyhitters problem, solving the TED problem without falsepositives requires large space ((N ) words). Thus inRAM heavyhitters algorithms typically sacrfice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).
We show how to adapt heavyhitters algorithms to exter nal memory to solve the TED problem on large highspeed streams while guaranteeing accuracy, sensitivity, and timeli ness. Our data structures are limited only by I/Obandwidth (not latency) and support a tunable tradeoff between report ing delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only more »
 Publication Date:
 NSFPAR ID:
 10298566
 Journal Name:
 SIGMOD record
 ISSN:
 19435835
 Sponsoring Org:
 National Science Foundation
More Like this


There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a blackbox adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the whitebox adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1heavy hitters problem that outperforms the optimal deterministic MisraGries algorithm on long streams. If the whitebox adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in amore »

Obeid, Iyad Selesnick (Ed.)Electroencephalography (EEG) is a popular clinical monitoring tool used for diagnosing brainrelated disorders such as epilepsy [1]. As monitoring EEGs in a criticalcare setting is an expensive and tedious task, there is a great interest in developing realtime EEG monitoring tools to improve patient care quality and efficiency [2]. However, clinicians require automatic seizure detection tools that provide decisions with at least 75% sensitivity and less than 1 false alarm (FA) per 24 hours [3]. Some commercial tools recently claim to reach such performance levels, including the Olympic Brainz Monitor [4] and Persyst 14 [5]. In this abstract, we describe our efforts to transform a highperformance offline seizure detection system [3] into a low latency realtime or online seizure detection system. An overview of the system is shown in Figure 1. The main difference between an online versus offline system is that an online system should always be causal and has minimum latency which is often defined by domain experts. The offline system, shown in Figure 2, uses two phases of deep learning models with postprocessing [3]. The channelbased long short term memory (LSTM) model (Phase 1 or P1) processes linear frequency cepstral coefficients (LFCC) [6] features from each EEGmore »

The problem of recovering heavy components of a highdimensional vector from compressed data is of great interest in broad applications, such as feature extraction under scarce computing memory and distributed learning under limited bandwidth. Recently, a compression algorithm called count sketch has gained wide popularity to recover heavy components in various fields. In this paper, we carefully analyze count sketch and illustrate that its default recovery method, namely median filtering, has a distinct error pattern of reporting false positives. To counteract this error pattern, we propose a new scheme called zero checking which adopts a twostep recovery approach to improve the probability of detecting false positives. Our proposed technique builds on rigorous error analysis, which enables us to optimize the selection of a key design parameter for maximum performance gain. The empirical results show that our scheme achieves better recovery accuracy than median filtering and requires less samples to accurately recover heavy components.

The data management of large companies often prioritize more recent data, as a source of higher accuracy prediction than outdated data. For example, the Facebook data policy retains user search histories for months while the Google data retention policy states that browser information may be stored for up to months. These policies are captured by the sliding window model, in which only the most recent statistics form the underlying dataset. In this paper, we consider the problem of privately releasing the L2heavy hitters in the sliding window model, which include Lpheavy hitters for p<=2 and in some sense are the strongest possible guarantees that can be achieved using polylogarithmic space, but cannot be handled by existing techniques due to the subadditivity of the L2 norm. Moreover, existing nonprivate sliding window algorithms use the smooth histogram framework, which has high sensitivity. To overcome these barriers, we introduce the first differentially private algorithm for L2heavy hitters in the sliding window model by initiating a number of L2heavy hitter algorithms across the stream with significantly lower threshold. Similarly, we augment the algorithms with an approximate frequency tracking algorithm with significantly higher accuracy. We then use smooth sensitivity and statistical distance arguments to showmore »