Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n . Given the sketch and a query item y , one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y . Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1 ± ε)error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most spaceefficient algorithms due to prior work store either O (log (ε 2 n )/ε 2 ) or O (log 3 (ε n )/ε) universe items. We present a randomized sketch storing O (log 1.5 (ε n )/ε) items that can (1 ± ε)approximate the rank of each universe item with high constant probability; this space bound is within an \(O(\sqrt {\log (\varepsilon n)}) \) factor of optimal. Our algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments.
more »
« less
Streaming Quantiles Algorithms with Small Space and Update Time
Approximating quantiles and distributions over streaming data has been studied for roughly two decades now. Recently, Karnin, Lang, and Liberty proposed the first asymptotically optimal algorithm for doing so. This manuscript complements their theoretical result by providing a practical variants of their algorithm with improved constants. For a given sketch size, our techniques provably reduce the upper bound on the sketch error by a factor of two. These improvements are verified experimentally. Our modified quantile sketch improves the latency as well by reducing the worstcase update time from O(1ε) down to O(log1ε).
more »
« less
 NSFPAR ID:
 10396505
 Date Published:
 Journal Name:
 Sensors
 Volume:
 22
 Issue:
 24
 ISSN:
 14248220
 Page Range / eLocation ID:
 9612
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We consider feature selection for applications in machine learning where the dimensionality of the data is so large that it exceeds the working memory of the (local) computing machine. Unfortunately, current largescale sketching algorithms show poor memoryaccuracy tradeoff in selecting features in high dimensions due to the irreversible collision and accumulation of the stochastic gradient noise in the sketched domain. Here, we develop a secondorder feature selection algorithm, called BEAR, which avoids the extra collisions by efficiently storing the secondorder stochastic gradients of the celebrated BroydenFletcherGoldfarbShannon (BFGS) algorithm in Count Sketch, using a memory cost that grows sublinearly with the size of the feature vector. BEAR reveals an unexplored advantage of secondorder optimization for memoryconstrained highdimensional gradient sketching. Our extensive experiments on several realworld data sets from genomics to language processing demonstrate that BEAR requires up to three orders of magnitude less memory space to achieve the same classification accuracy compared to the firstorder sketching algorithms with a comparable run time. Our theoretical analysis further proves the global convergence of BEAR with O(1/𝑡) rate in 𝑡 iterations of the sketched algorithm.more » « less

We study two popular ways to sketch the shortest path distances of an input graph. The first is distance preservers , which are sparse subgraphs that agree with the distances of the original graph on a given set of demand pairs. Prior work on distance preservers has exploited only a simple structural property of shortest paths, called consistency , stating that one can break shortest path ties such that no two paths intersect, split apart, and then intersect again later. We prove that consistency alone is not enough to understand distance preservers, by showing both a lower bound on the power of consistency and a new general upper bound that polynomially surpasses it. Specifically, our new upper bound is that any p demand pairs in an n node undirected unweighted graph have a distance preserver on O( n 2/3 p 2/3 + np 1/3 edges. We leave a conjecture that the right bound is O ( n 2/3 p 2/3 + n ) or better. The second part of this paper leverages these distance preservers in a new construction of additive spanners , which are subgraphs that preserve all pairwise distances up to an additive error function. We give improved error bounds for spanners with relatively few edges; for example, we prove that all graphs have spanners on O(n) edges with + O ( n 3/7 + ε ) error. Our construction can be viewed as an extension of the popular pathbuying framework to clusters of larger radii.more » « less

An εapproximate quantile sketch over a stream of n inputs approximates the rank of any query point q—that is, the number of input points less than q—up to an additive error of εn, generally with some probability of at least 1−1/ poly(n), while consuming o(n) space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worstcase streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning’s tdigest, which often achieves much better approximations than KLL on realworld data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on realworld data sets than KLL while maintaining similar guarantees in the worst case.more » « less

In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federatedmore » « less
k means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running anyk means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, lightweight, and differentially private set intersection cardinality estimation algorithm based on the FlajoletMartin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the finalk means loss to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniques