An ε-approximate quantile sketch over a stream of n inputs approximates the rank of any query point q—that is, the number of input points less than q—up to an additive error of εn, generally with some probability of at least 1−1/ poly(n), while consuming o(n) space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning’s t-digest, which often achieves much better approximations than KLL on realworld data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case.
more »
« less
KLL ± approximate quantile sketches over dynamic datasets
Recently the long standing problem of optimal construction of quantile sketches was resolved by K arnin, L ang, and L iberty using the KLL sketch (FOCS 2016). The algorithm for KLL is restricted to online insert operations and no delete operations. For many real-world applications, it is necessary to support delete operations. When the data set is updated dynamically, i.e., when data elements are inserted and deleted, the quantile sketch should reflect the changes. In this paper, we propose KLL ± , the first quantile approximation algorithm to operate in the bounded deletion model to account for both inserts and deletes in a given data stream. KLL ± extends the functionality of KLL sketches to support arbitrary updates with small space overhead. The space bound for KLL ± is [EQUATION], where ∈ and δ are constants that determine precision and failure probability, and α bounds the number of deletions with respect to insert operations. The experimental evaluation of KLL ± highlights that with minimal space overhead, KLL ± achieves comparable accuracy in quantile approximation to KLL.
more »
« less
- Award ID(s):
- 1703560
- PAR ID:
- 10316014
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- Volume:
- 14
- Issue:
- 7
- ISSN:
- 2150-8097
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n . Given the sketch and a query item y , one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y . Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1 ± ε)-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either O (log (ε 2 n )/ε 2 ) or O (log 3 (ε n )/ε) universe items. We present a randomized sketch storing O (log 1.5 (ε n )/ε) items that can (1 ± ε)-approximate the rank of each universe item with high constant probability; this space bound is within an \(O(\sqrt {\log (\varepsilon n)}) \) factor of optimal. Our algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments.more » « less
-
Per-flow size measurement is key to many streaming applications and management systems, particularly in high-speed networks. Performing such measurement on the data plane of a network device at the line rate requires on-chip memory and computing resources that are shared by other key network functions. It leads to the need for very compact and fast data structures, called sketches, which trade off space for accuracy. Such a need also arises in other application context for extremely large data sets. The goal of sketch design is two-fold: to measure flow size as accurately as possible and to do so as efficiently as possible (for low overhead and thus high processing throughput). The existing sketches can be broadly categorized to multi-update sketches and single update sketches. The former are more accurate but carry larger overhead. The latter incur small overhead but their accuracy is poor. This paper proposes a Single update Sketch with a Variable counter Structure (SSVS), a new sketch design which is several times faster than the existing multi-update sketches with comparable accuracy, and is several times more accurate than the existing single update sketches with comparable overhead. The new sketch design embodies several technical contributions that integrate the enabling properties from both multi-update sketches and single update sketches in a novel structure that effectively controls the measurement error with minimum processing overhead.more » « less
-
Linear sketches have been widely adopted to process fast data streams, and they can be used to accurately answer frequency estimation, approximate top K items, and summarize data distributions. When data are sensitive, it is desirable to provide privacy guarantees for linear sketches to preserve private information while delivering useful results with theoretical bounds. We show that linear sketches can ensure privacy and maintain their unique properties with a small amount of noise added at initialization. From the differentially private linear sketches, we showcase that the state-of-the-art quantile sketch in the turnstile model can also be private and maintain high performance. Experiments further demonstrate that our proposed differentially private sketches are quantitatively and qualitatively similar to noise-free sketches with high utilization on synthetic and real datasets.more » « less
-
Sketches are single-pass small-space data summaries that can quickly estimate the cardinality of join queries. However, sketches are not directly applicable to join queries with dynamic filter conditions --- where arbitrary selection predicate(s) are applied --- since a sketch is limited to a fixed selection. While multiple sketches for various selections can be used in combination, they each incur individual storage and maintenance costs. Alternatively, exact sketches can be built during runtime for every selection. To make this process scale, a high-degree of parallelism --- available in hardware accelerators such as GPUs --- is required. Therefore, sketch usage for cardinality estimation in query optimization is limited. Following recent work that applies transformers to cardinality estimation, we design a novel learning-based method to approximate the sketch of any arbitrary selection, enabling sketches for join queries with filter conditions. We train a transformer on each table to estimate the sketch of any subset of the table, i.e., any arbitrary selection. Transformers achieve this by learning the joint distribution amongst table attributes, which is equivalent to a multidimensional sketch. Subsequently, transformers can approximate any sketch, enabling sketches for join cardinality estimation. In turn, estimating joins via approximate sketches allows tables to be modeled individually and thus scales linearly with the number of tables. We evaluate the accuracy and efficacy of approximate sketches on queries with selection predicates consisting of conjunctions of point and range conditions. Approximate sketches achieve similar accuracy to exact sketches with at least one order of magnitude less overhead.more » « less
An official website of the United States government

