Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n . Given the sketch and a query item y , one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y . Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1 ± ε)-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either O (log (ε 2 n )/ε 2 ) or O (log 3 (ε n )/ε) universe items. We present a randomized sketch storing O (log 1.5 (ε n )/ε) items that can (1 ± ε)-approximate the rank of each universe item with high constant probability; this space bound is within an \(O(\sqrt {\log (\varepsilon n)}) \) factor of optimal. Our algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments.
more »
« less
Streaming Algorithms for Support-Aware Histograms
Histograms, i.e., piece-wise constant approximations, are a popular tool used to represent data distributions. Traditionally, the difference between the histogram and the underlying distribution (i.e., the approximation error) is measured using the Lp norm, which sums the differences between the two functions over all items in the domain. Although useful in many applications, the drawback of this error measure is that it treats approximation errors of all items in the same way, irrespective of whether the mass of an item is important for the downstream application that uses the approximation. As a result, even relatively simple distributions cannot be approximated by succinct histograms without incurring large error. In this paper, we address this issue by adapting the definition of approximation so that only the errors of the items that belong to the support of the distribution are considered. Under this definition, we develop efficient 1-pass and 2-pass streaming algorithms that compute near-optimal histograms in sub-linear space. We also present lower bounds on the space complexity of this problem. Surprisingly, under this notion of error, there is an exponential gap in the space complexity of 1-pass and 2-pass streaming algorithms. Finally, we demonstrate the utility of our algorithms on a collection of real and synthetic data sets.
more »
« less
- Award ID(s):
- 2022448
- PAR ID:
- 10341771
- Date Published:
- Journal Name:
- International Conference on Machine Learning
- Page Range / eLocation ID:
- 3184-3203
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible. We first demonstrate that, for the fundamental Fpmoment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n1-1/p) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for Fpmoment estimation that use a near-optimal ~Oε(n1-1/p) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~Oε(n1-1/p) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity.more » « less
-
We study the problem of maximizing a non-monotone submodular function subject to a cardinality constraint in the streaming model. Our main contributions are two single-pass (semi-)streaming algorithms that use $$\tilde{O}(k)\cdot\mathrm{poly}(1/\varepsilon)$$ memory, where $$k$$ is the size constraint. At the end of the stream, both our algorithms post-process their data structures using any offline algorithm for submodular maximization, and obtain a solution whose approximation guarantee is $$\frac{\alpha}{1+\alpha}-\varepsilon$$, where $$\alpha$$ is the approximation of the offline algorithm. If we use an exact (exponential time) post-processing algorithm, this leads to $$\frac{1}{2}-\varepsilon$$ approximation (which is nearly optimal). If we post-process with the algorithm of Buchbinder-Feldman '19, that achieves the state-of-the-art offline approximation guarantee of $$\alpha=0.385$$, we obtain $0.2779$-approximation in polynomial time, improving over the previously best polynomial-time approximation of $0.1715$ due to Feldman'18. One of our algorithms is combinatorial and enjoys fast update and overall running times. Our other algorithm is based on the multilinear extension, enjoys an improved space complexity, and can be made deterministic in some settings of interest.more » « less
-
Real-time decision making in IoT applications relies upon space-efficient evaluation of queries over streaming data. To model the uncertainty in the classification of data being processed, we consider the model of probabilistic strings --- sequences of discrete probability distributions over a finite set of events, and initiate the study of space complexity of streaming computation for different classes of queries over such probabilistic strings. We first consider the problem of computing the probability that a word, sampled from the distribution defined by the probabilistic string read so far, is accepted by a given deterministic finite automaton. We show that this regular pattern matching problem can be solved using space that is only poly-logarithmic in the string length (and polynomial in the size of the DFA) if we are allowed a multiplicative approximation error. Then we show how to generalize this result to quantitative queries specified by additive cost register automata --- these are automata that map strings to numerical values using finite control and registers that get updated using linear transformations. Finally, we consider the case when updates in such an automaton involve tests, and in particular, when there is a counter variable that can be either incremented or decremented but decrements only apply when the counter value is non-zero. In this case, the desired answer depends on the probability distribution over the set of possible counter values that can range from 0 to n for a string of length n. Under a mild assumption, namely probabilities of the individual events are bounded away from 0 and 1, we show that there is an algorithm that can compute all n entries of this probability distribution vector to within additive 1/poly(n) error using space that is only Õ(n). In establishing these results, we introduce several new technical ideas that may prove useful for designing space-efficient algorithms for other query models over probabilistic strings.more » « less
-
We explore the use of local algorithms in the design of streaming algorithms for the Maximum Directed Cut problem. Specifically, building on the local algorithm of (Buchbinder, Feldman, Seffi, and Schwartz [14] and Censor-Hillel, Levy, and Shachnai [16]), we develop streaming algorithms for both adversarially and randomly ordered streams that approximate the value of maximum directed cut in bounded-degree graphs. In n-vertex graphs, for adversarially ordered streams, our algorithm uses O (n1-Ω(1)) (sub-linear) space and for randomly ordered streams, our algorithm uses logarithmic space. Moreover, both algorithms require only one pass over the input stream. With a constant number of passes, we give a logarithmic-space algorithm which works even on graphs with unbounded degree on adversarially ordered streams. Our algorithms achieve any fixed constant approximation factor less than 1/2. In the single-pass setting, this is tight: known lower bounds show that obtaining any constant approximation factor greater than 1/2 is impossible without using linear space in adversarially ordered streams (Kapralov and Krachun [37]) and space in randomly ordered streams, even on bounded degree graphs (Kapralov, Khanna, and Sudan [35]). In terms of techniques, our algorithms partition the vertices into a small number of different types based on the structure of their local neighborhood, ensuring that each type carries enough information about the structure to approximately simulate the local algorithm on a vertex with that type. We then develop tools to accurately estimate the frequency of each type. This allows us to simulate an execution of the local algorithm on all vertices, and thereby approximate the value of the maximum directed cut.more » « less
An official website of the United States government

