Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe \mathcalU with total order, an additive-error quantile sketch \mathcalM allows us to approximate the rank of any query y\in \mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space \citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 \citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the \frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~\citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_\textrmmin )), achieving an improvement over the previous upper bound of \citeassadi2023generalizing.
more »
« less
This content will become publicly available on May 30, 2025
Online List Labeling with Predictions
A growing line of work shows how learned predictions can be used to break through worst-cast barriers to improve the running time of an algorithm. However, incorporating predictions into data structures with strong theoretical guarantees remains underdeveloped. This paper takes a step in this direction by showing that predictions can be leveraged in the fundamental online list labeling problem. In the problem, n items arrive over time and must be stored in sorted order in an array of size Θ(n). The array slot of an element is its label and the goal is to maintain sorted order while minimizing the total number of elements moved (i.e., relabeled). We design a new list labeling data structure and bound its performance in two models. In the worst-case learning-augmented model, we give guarantees in terms of the error in the predictions. Our data structure provides strong theoretical guarantees— it is optimal for any prediction error and guarantees the best-known worst-case bound even when the predictions are entirely erroneous. We also consider a stochastic error model and bound the performance in terms of the expectation and variance of the error. Finally, the theoretical results are demonstrated empirically. In particular, we show that our data structure performs well on numerous real datasets, including temporal datasets where predictions are constructed from elements that arrived in the past (as is typically done in a practical use case).
more »
« less
- Award ID(s):
- 1947789
- PAR ID:
- 10565735
- Publisher / Repository:
- Curran Associates Inc.
- Date Published:
- Format(s):
- Medium: X
- Location:
- New Orleans, LA, USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically bound the error of the estimated frequencies for any possible input. The work of Hsu et al.~(2019) introduced the idea of using machine learning to tailor sketching algorithms to the specific data distribution they are being run on. In particular, their learning-augmented frequency estimation algorithm uses a learned heavy-hitter oracle which predicts which elements will appear many times in the stream. We give a novel algorithm, which in some parameter regimes, already theoretically outperforms the learning based algorithm of Hsu et al. without the use of any predictions. Augmenting our algorithm with heavy-hitter predictions further reduces the error and improves upon the state of the art. Empirically, our algorithms achieve superior performance in all experiments compared to prior approaches.more » « less
-
Salakhutdinov, Ruslan; Kolter, Zico; Heller, Katherine; Weller, Adrian; Oliver, Nuria; Scarlett, Jonathan; Berkenkamp, Felix (Ed.)This paper leverages the framework of algorithms-with-predictions to design data structures for two fundamental dynamic graph problems: incremental topological ordering and cycle detection. In these problems, the input is a directed graph on n nodes, and the m edges arrive one by one. The data structure must maintain a topological ordering of the vertices at all times and detect if the newly inserted edge creates a cycle. The theoretically best worst-case algorithms for these problems have high update cost (polynomial in n and m). In practice, greedy heuristics (that recompute the solution from scratch each time) perform well but can have high update cost in the worst case. In this paper, we bridge this gap by leveraging predictions to design a learned new data structure for the problems. Our data structure guarantees consistency, robustness, and smoothness with respect to predictions--that is, it has the best possible running time under perfect predictions, never performs worse than the best-known worst-case methods, and its running time degrades smoothly with the prediction error. Moreover, we demonstrate empirically that predictions, learned from a very small training dataset, are sufficient to provide significant speed-ups on real datasets.more » « less
-
The indexing algorithms for the high-dimensional nearest neighbor search (NNS) with the best worst-case guarantees are based on the randomized Locality Sensitive Hashing (LSH), and its derivatives. In practice, many heuristic approaches exist to "learn" the best indexing method in order to speed-up NNS, crucially adapting to the structure of the given dataset. Oftentimes, these heuristics outperform the LSH-based algorithms on real datasets, but, almost always, come at the cost of losing the guarantees of either correctness or robust performance on adversarial queries, or apply to datasets with an assumed extra structure/model. In this paper, we design an NNS algorithm for the Hamming space that has worst-case guarantees essentially matching that of theoretical algorithms, while optimizing the hashing to the structure of the dataset (think instance-optimal algorithms) for performance on the minimum-performing query. We evaluate the algorithm's ability to optimize for a given dataset both theoretically and practically. On the theoretical side, we exhibit a natural setting (dataset model) where our algorithm is much better than the standard theoretical one. On the practical side, we run experiments that show that our algorithm has a 1.8x and 2.1x better recall on the worst-performing queries to the MNIST and ImageNet datasets. https://arxiv.org/abs/2108.05433more » « less
-
Guruswami, Venkatesan (Ed.)The problem of sorting with priced information was introduced by [Charikar, Fagin, Guruswami, Kleinberg, Raghavan, Sahai (CFGKRS), STOC 2000]. In this setting, different comparisons have different (potentially infinite) costs. The goal is to find a sorting algorithm with small competitive ratio, defined as the (worst-case) ratio of the algorithm’s cost to the cost of the cheapest proof of the sorted order. The simple case of bichromatic sorting posed by [CFGKRS] remains open: We are given two sets A and B of total size N, and the cost of an A-A comparison or a B-B comparison is higher than an A-B comparison. The goal is to sort A ∪ B. An Ω(log N) lower bound on competitive ratio follows from unit-cost sorting. Note that this is a generalization of the famous nuts and bolts problem, where A-A and B-B comparisons have infinite cost, and elements of A and B are guaranteed to alternate in the final sorted order. In this paper we give a randomized algorithm InversionSort with an almost-optimal w.h.p. competitive ratio of O(log³ N). This is the first algorithm for bichromatic sorting with a o(N) competitive ratio.more » « less