Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n . Given the sketch and a query item y , one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y . Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1 ± ε)-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either O (log (ε 2 n )/ε 2 ) or O (log 3 (ε n )/ε) universe items. We present a randomized sketch storing O (log 1.5 (ε n )/ε) items that can (1 ± ε)-approximate the rank of each universe item with high constant probability; this space bound is within an \(O(\sqrt {\log (\varepsilonmore »
Relative Error Streaming Quantiles
Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n. Given the sketch and a query item y, one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y.
- Publication Date:
- NSF-PAR ID:
- 10358227
- Journal Name:
- ACM SIGMOD Record
- Volume:
- 51
- Issue:
- 1
- Page Range or eLocation-ID:
- 69 to 76
- ISSN:
- 0163-5808
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract
<p>This dataset includes multiple fields: (i) files for monthly and annual fields for the max curl line and the zero curl line at 0.1 degree longitudinal resolutions; (ii) files for monthly and annual GS path obtained from Altimetry and originally processed by Andres (2016) at 0.1 degree longitudinal resolution. The maximum curl line (MCL) and the zero curl line (ZCL) calculations are briefly described here and are based on the original wind data (at 1.25 x 1.25 degree) provided by the Japanese reanalysis (JRA-55; Kobayashi et al., 2015) and available at https://zenodo.org/record/8200832 (Gifford et al. 2023). For details see Gifford, 2023. </p> <p>The wind stress curl (WSC) fields used for the MCL and ZCL calculations extend from 80W to 45W and 30N to 45N at the 1.25 by 1.25-degree resolution. The MCL is defined as the maximum WSC values greater than zero within the domain per 1.25 degree longitude. As such, it is a function of longitude and is not a constant WSC value unlike the zero contour. High wind stress curl values that occurred near the coast were not included within this calculation. After MCL at the 1.25 resolution was obtained the line was smoothed with a gaussian smoothing -
Frequency estimation is one of the most fundamental problems in streaming algorithms. Given a stream S of elements from some universe U={1…n}, the goal is to compute, in a single pass, a short sketch of S so that for any element i∈U, one can estimate the number xi of times i occurs in S based on the sketch alone. Two state of the art solutions to this problems are the Count-Min and Count-Sketch algorithms. The frequency estimator x~ produced by Count-Min, using O(1/ε⋅logn) dimensions, guarantees that ∥x~−x∥∞≤ε∥x∥1 with high probability, and x~≥x holds deterministically. Also, Count-Min works under the assumption that x≥0. On the other hand, Count-Sketch, using O(1/ε2⋅logn) dimensions, guarantees that ∥x~−x∥∞≤ε∥x∥2 with high probability. A natural question is whether it is possible to design the best of both worlds sketching method, with error guarantees depending on the ℓ2 norm and space comparable to Count-Sketch, but (like Count-Min) also has the no-underestimation property. Our main set of results shows that the answer to the above question is negative. We show this in two incomparable computational models: linear sketching and streaming algorithms. We also study the complementary problem, where the sketch is required to not over-estimate, i.e., x~≤x should holdmore »
-
We initiate a systematic study of linear sketching over F_2. For a given Boolean function treated as f : F_2^n -> F_2 a randomized F_2-sketch is a distribution M over d x n matrices with elements over F_2 such that Mx suffices for computing f(x) with high probability. Such sketches for d << n can be used to design small-space distributed and streaming algorithms. Motivated by these applications we study a connection between F_2-sketching and a two-player one-way communication game for the corresponding XOR-function. We conjecture that F_2-sketching is optimal for this communication game. Our results confirm this conjecture for multiple important classes of functions: 1) low-degree F_2-polynomials, 2) functions with sparse Fourier spectrum, 3) most symmetric functions, 4) recursive majority function. These results rely on a new structural theorem that shows that F_2-sketching is optimal (up to constant factors) for uniformly distributed inputs. Furthermore, we show that (non-uniform) streaming algorithms that have to process random updates over F_2 can be constructed as F_2-sketches for the uniform distribution. In contrast with the previous work of Li, Nguyen and Woodruff (STOC'14) who show an analogous result for linear sketches over integers in the adversarial setting our result does not require themore »
-
We initiate a systematic study of linear sketching over $\ftwo$. For a given Boolean function treated as $f \colon \ftwo^n \to \ftwo$ a randomized $\ftwo$-sketch is a distribution $\mathcal M$ over $d \times n$ matrices with elements over $\ftwo$ such that $\mathcal Mx$ suffices for computing $f(x)$ with high probability. Such sketches for $d \ll n$ can be used to design small-space distributed and streaming algorithms. Motivated by these applications we study a connection between $\ftwo$-sketching and a two-player one-way communication game for the corresponding XOR-function. We conjecture that $\ftwo$-sketching is optimal for this communication game. Our results confirm this conjecture for multiple important classes of functions: 1) low-degree $\ftwo$-polynomials, 2) functions with sparse Fourier spectrum, 3) most symmetric functions, 4) recursive majority function. These results rely on a new structural theorem that shows that $\ftwo$-sketching is optimal (up to constant factors) for uniformly distributed inputs. Furthermore, we show that (non-uniform) streaming algorithms that have to process random updates over $\ftwo$ can be constructed as $\ftwo$-sketches for the uniform distribution. In contrast with the previous work of Li, Nguyen and Woodruff (STOC'14) who show an analogous result for linear sketches over integers in the adversarial setting our result does notmore »