skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Streaming Quantiles Algorithms with Small Space and Update Time
Approximating quantiles and distributions over streaming data has been studied for roughly two decades now. Recently, Karnin, Lang, and Liberty proposed the first asymptotically optimal algorithm for doing so. This manuscript complements their theoretical result by providing a practical variants of their algorithm with improved constants. For a given sketch size, our techniques provably reduce the upper bound on the sketch error by a factor of two. These improvements are verified experimentally. Our modified quantile sketch improves the latency as well by reducing the worst-case update time from O(1ε) down to O(log1ε).  more » « less
Award ID(s):
2244899 2107239 2333887
PAR ID:
10396505
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Sensors
Volume:
22
Issue:
24
ISSN:
1424-8220
Page Range / eLocation ID:
9612
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe \mathcalU with total order, an additive-error quantile sketch \mathcalM allows us to approximate the rank of any query y\in \mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space \citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 \citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the \frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~\citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_\textrmmin )), achieving an improvement over the previous upper bound of \citeassadi2023generalizing. 
    more » « less
  2. A $$(1\pm\epsilon)$$ -sparsifier of a hypergraph $G(V, E)$ is a (weighted) subgraph that preserves the value of every cut to within a $$(1\pm\epsilon)$$ -factor. It is known that every hypergraph with $$n$$ vertices admits a $$(1 \pm \epsilon)$$ -sparsifier with $$\tilde{O}(n/{\epsilon}^{2})$$ hyperedges. In this work, we explore the task of building such a sparsifier by using only linear measurements (a linear sketch) over the hyperedges of $$G$$, and provide nearly-matching upper and lower bounds for this task. Specifically, we show that there is a randomized linear sketch of size $$\tilde{O}(nr\log(m)/\epsilon^{2})$$ bits which with high probability contains sufficient information to recover a $$(1\pm\epsilon)$$ cut-sparsifier with $$\tilde{O}(n/\epsilon^{2})$$ hyperedges for any hypergraph with at most $$m$$ edges each of which has arity bounded by $$r$$. This immediately gives a dynamic streaming algorithm for hypergraph cut sparsification with an identical space complexity, improving on the previous best known bound of $$\tilde{O}(nr^{2}\log^{4}({m})/\epsilon^{2})$$ bits of space (Guha, McGregor, and Tench, PODS 2015). We complement our algorithmic result above with a nearly-matching lower bound. We show that for every $$\epsilon\in(0,1)$$, one needs $$\Omega(nr\log(m/n)/\log(n))$$ bits to construct a $$(1\pm\epsilon)$$ -sparsifier via linear sketching, thus showing that our linear sketch achieves an optimal dependence on both $$r$$ and $$\log(m)$$. The starting point for our improved algorithm is importance sampling of hyperedges based on the new notion of $$k$$ -cut strength introduced in the recent work of Quanrud (SODA 2024). The natural algorithm based on this concept leads to $$\log m$$ levels of sampling where errors can potentially accumulate, and this accounts for the polylog $(m)$ losses in the sketch size of the natural algorithm. We develop a more intricate analysis of the accumulation in error to show most levels do not contribute to the error and actual loss is only polylog $(n)$. Combining with careful preprocessing (and analysis) this enables us to get rid of all extraneous $$\log m$$ factors in the sketch size, but the quadratic dependence on $$r$$ remains. This dependence originates from use of correlated $$\ell_{0}$$ -samplers to recover a large number of low-strength edges in a hypergraph simultaneously by looking at neighborhoods of individual vertices. In graphs, this leads to discovery of $$\Omega(n)$$ edges in a single shot, whereas in hypergraphs, this may potentially only reveal $$O$$($$n$$/$$r$$) new edges, thus requiring $$\Omega(r)$$ rounds of recovery. To remedy this we introduce a new technique of random fingerprinting of hyperedges which effectively eliminates the correlations created by large arity hyperedges, and leads to a scheme for recovering hyperedges of low strength with an optimal dependence on $$r$$. Putting all these ingredients together yields our linear sketching algorithm. Our lower bound is established by a reduction from the universal relation problem in the one-way communication setting. 
    more » « less
  3. A $$(1\pm\epsilon)$$ -sparsifier of a hypergraph $G(V, E)$ is a (weighted) subgraph that preserves the value of every cut to within a $$(1\pm\epsilon)$$ -factor. It is known that every hypergraph with $$n$$ vertices admits a $$(1 \pm \epsilon)$$ -sparsifier with $$\tilde{O}(n/{\epsilon}^{2})$$ hyperedges. In this work, we explore the task of building such a sparsifier by using only linear measurements (a linear sketch) over the hyperedges of $$G$$, and provide nearly-matching upper and lower bounds for this task. Specifically, we show that there is a randomized linear sketch of size $$\tilde{O}(nr\log(m)/\epsilon^{2})$$ bits which with high probability contains sufficient information to recover a $$(1\pm\epsilon)$$ cut-sparsifier with $$\tilde{O}(n/\epsilon^{2})$$ hyperedges for any hypergraph with at most $$m$$ edges each of which has arity bounded by $$r$$. This immediately gives a dynamic streaming algorithm for hypergraph cut sparsification with an identical space complexity, improving on the previous best known bound of $$\tilde{O}(nr^{2}\log^{4}({m})/\epsilon^{2})$$ bits of space (Guha, McGregor, and Tench, PODS 2015). We complement our algorithmic result above with a nearly-matching lower bound. We show that for every $$\epsilon\in(0,1)$$, one needs $$\Omega(nr\log(m/n)/\log(n))$$ bits to construct a $$(1\pm\epsilon)$$ -sparsifier via linear sketching, thus showing that our linear sketch achieves an optimal dependence on both $$r$$ and $$\log(m)$$. The starting point for our improved algorithm is importance sampling of hyperedges based on the new notion of $$k$$ -cut strength introduced in the recent work of Quanrud (SODA 2024). The natural algorithm based on this concept leads to $$\log m$$ levels of sampling where errors can potentially accumulate, and this accounts for the polylog $(m)$ losses in the sketch size of the natural algorithm. We develop a more intricate analysis of the accumulation in error to show most levels do not contribute to the error and actual loss is only polylog $(n)$. Combining with careful preprocessing (and analysis) this enables us to get rid of all extraneous $$\log m$$ factors in the sketch size, but the quadratic dependence on $$r$$ remains. This dependence originates from use of correlated $$\ell_{0}$$ -samplers to recover a large number of low-strength edges in a hypergraph simultaneously by looking at neighborhoods of individual vertices. In graphs, this leads to discovery of $$\Omega(n)$$ edges in a single shot, whereas in hypergraphs, this may potentially only reveal $$O$$($$n$$/$$r$$) new edges, thus requiring $$\Omega(r)$$ rounds of recovery. To remedy this we introduce a new technique of random fingerprinting of hyperedges which effectively eliminates the correlations created by large arity hyperedges, and leads to a scheme for recovering hyperedges of low strength with an optimal dependence on $$r$$. Putting all these ingredients together yields our linear sketching algorithm. Our lower bound is established by a reduction from the universal relation problem in the one-way communication setting. 
    more » « less
  4. Abstract BackgroundGiven a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. ResultsIn this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in$$\mathcal {O} (|t| + |p| + \ell ^2)$$ O ( | t | + | p | + 2 ) time and$$\mathcal {O} (\ell \log \ell )$$ O ( log ) space, where |t| is the number of$$k$$ k -mers inside the sketch of the reference, |p| is the number of$$k$$ k -mers inside the read’s sketch and$$\ell$$ is the number of times that$$k$$ k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2. 
    more » « less
  5. Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n . Given the sketch and a query item y , one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y . Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1 ± ε)-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either O (log (ε 2 n )/ε 2 ) or O (log  3 (ε n )/ε) universe items. We present a randomized sketch storing O (log  1.5 (ε n )/ε) items that can (1 ± ε)-approximate the rank of each universe item with high constant probability; this space bound is within an \(O(\sqrt {\log (\varepsilon n)}) \) factor of optimal. Our algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments. 
    more » « less