Estimating the ε-approximate quantiles or ranks of a stream is a fundamental task in data monitoring. Given a stream x_1,..., x_n from a universe \mathcalU with total order, an additive-error quantile sketch \mathcalM allows us to approximate the rank of any query y\in \mathcalU up to additive ε n error. In 2001, Greenwald and Khanna gave a deterministic algorithm (GK sketch) that solves the ε-approximate quantiles estimation problem using O(ε^-1 łog(ε n)) space \citegreenwald2001space ; recently, this algorithm was shown to be optimal by Cormode and Vesleý in 2020 \citecormode2020tight. However, due to the intricacy of the GK sketch and its analysis, over-simplified versions of the algorithm are implemented in practical applications, often without any known theoretical guarantees. In fact, it has remained an open question whether the GK sketch can be simplified while maintaining the optimal space bound. In this paper, we resolve this open question by giving a simplified deterministic algorithm that stores at most (2 + o(1))ε^-1 łog (ε n) elements and solves the additive-error quantile estimation problem; as a side benefit, our algorithm achieves a smaller constant factor than the \frac11 2 ε^-1 łog(ε n) space bound in the original GK sketch~\citegreenwald2001space. Our algorithm features an easier analysis and still achieves the same optimal asymptotic space complexity as the original GK sketch. Lastly, our simplification enables an efficient data structure implementation, with a worst-case runtime of O(łog(1/ε) + łog łog (ε n)) per-element for the ordinary ε-approximate quantile estimation problem. Also, for the related weighted'' quantile estimation problem, we give efficient data structures for our simplified algorithm which guarantee a worst-case per-element runtime of O(łog(1/ε) + łog łog (ε W_n/w_\textrmmin )), achieving an improvement over the previous upper bound of \citeassadi2023generalizing.
more »
« less
Tensor Completion with Nearly Linear Samples Given Weak Side Information
Tensor completion exhibits an interesting computational-statistical gap in terms of the number of samples needed to perform tensor estimation. While there are only Θ(tn) degrees of freedom in a t-order tensor with n^t entries, the best known polynomial time algorithm requires O(n^t/2 ) samples in order to guarantee consistent estimation. In this paper, we show that weak side information is sufficient to reduce the sample complexity to O(n). The side information consists of a weight vector for each of the modes which is not orthogonal to any of the latent factors along that mode; this is significantly weaker than assuming noisy knowledge of the subspaces. We provide an algorithm that utilizes this side information to produce a consistent estimator with O(n^1+κ ) samples for any small constant κ > 0. We also provide experiments on both synthetic and real-world datasets that validate our theoretical insights.
more »
« less
- PAR ID:
- 10332706
- Date Published:
- Journal Name:
- Proceedings of the ACM on Measurement and Analysis of Computing Systems
- Volume:
- 6
- Issue:
- 2
- ISSN:
- 2476-1249
- Page Range / eLocation ID:
- 1 to 35
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We propose a new fast streaming algorithm for the tensor completion problem of imputing missing entries of a lowtubal-rank tensor using the tensor singular value decomposition (t-SVD) algebraic framework. We show the t-SVD is a specialization of the well-studied block-term decomposition for third-order tensors, and we present an algorithm under this model that can track changing free submodules from incomplete streaming 2-D data. The proposed algorithm uses principles from incremental gradient descent on the Grassmann manifold of subspaces to solve the tensor completion problem with linear complexity and constant memory in the number of time samples. We provide a local expected linear convergence result for our algorithm. Our empirical results are competitive in accuracy but much faster in compute time than state-of-the-art tensor completion algorithms on real applications to recover temporal chemo-sensing and MRI data under limited sampling.more » « less
-
We consider the problem of designing sublinear time algorithms for estimating the cost of minimum] metric traveling salesman (TSP) tour. Specifically, given access to a n × n distance matrix D that specifies pairwise distances between n points, the goal is to estimate the TSP cost by performing only sublinear (in the size of D) queries. For the closely related problem of estimating the weight of a metric minimum spanning tree (MST), it is known that for any epsilon > 0, there exists an O^~(n/epsilon^O(1))-time algorithm that returns a (1+epsilon)-approximate estimate of the MST cost. This result immediately implies an O^~(n/epsilon^O(1)) time algorithm to estimate the TSP cost to within a (2 + epsilon) factor for any epsilon > 0. However, no o(n^2)-time algorithms are known to approximate metric TSP to a factor that is strictly better than 2. On the other hand, there were also no known barriers that rule out existence of (1 + epsilon)-approximate estimation algorithms for metric TSP with O^~ (n) time for any fixed epsilon > 0. In this paper, we make progress on both algorithms and lower bounds for estimating metric TSP cost. On the algorithmic side, we first consider the graphic TSP problem where the metric D corresponds to shortest path distances in a connected unweighted undirected graph. We show that there exists an O^~(n) time algorithm that estimates the cost of graphic TSP to within a factor of (2 − epsilon_0) for some epsilon_0 > 0. This is the first sublinear cost estimation algorithm for graphic TSP that achieves an approximation factor less than 2. We also consider another well-studied special case of metric TSP, namely, (1, 2)-TSP where all distances are either 1 or 2, and give an O^~(n ^ 1.5) time algorithm to estimate optimal cost to within a factor of 1.625. Our estimation algorithms for graphic TSP as well as for (1, 2)-TSP naturally lend themselves to O^~(n) space streaming algorithms that give an 11/6-approximation for graphic TSP and a 1.625-approximation for (1, 2)-TSP. These results motivate the natural question if analogously to metric MST, for any epsilon > 0, (1 + epsilon)-approximate estimates can be obtained for graphic TSP and (1, 2)-TSP using O^~ (n) queries. We answer this question in the negative – there exists an epsilon_0 > 0, such that any algorithm that estimates the cost of graphic TSP ((1, 2)-TSP) to within a (1 + epsilon_0)-factor, necessarily requires (n^2) queries. This lower bound result highlights a sharp separation between the metric MST and metric TSP problems. Similarly to many classical approximation algorithms for TSP, our sublinear time estimation algorithms utilize subroutines for estimating the size of a maximum matching in the underlying graph. We show that this is not merely an artifact of our approach, and that for any epsilon > 0, any algorithm that estimates the cost of graphic TSP or (1, 2)-TSP to within a (1 + epsilon)-factor, can also be used to estimate the size of a maximum matching in a bipartite graph to within an epsilon n additive error. This connection allows us to translate known lower bounds for matching size estimation in various models to similar lower bounds for metric TSP cost estimation.more » « less
-
Abstract We present a randomized approach for wait-free locks with strong bounds on time and fairness in a context in which any process can be arbitrarily delayed. Our approach supports a tryLock operation that is given a set of locks, and code to run when all the locks are acquired. A tryLock operation may fail if there is contention on the locks, in which case the code is not run. Given an upper bound$$\kappa $$ known to the algorithm on the point contention of any lock, and an upper boundLon the number of locks in a tryLock’s set, a tryLock will succeed in acquiring its locks and running the code with probability at least$$1/(\kappa L)$$ . It is thus fair. Furthermore, if the maximum step complexity for the code in any lock isT, the operation will take$$O(\kappa ^2 L^2 T)$$ steps, regardless of whether it succeeds or fails. The operations are independent, thus if the tryLock is repeatedly retried on failure, it will succeed in$$O(\kappa ^3 L^3 T)$$ expected steps. If the algorithm does not know the bounds$$\kappa $$ andL, we present a variant that can guarantee a probability of at least$$1/\kappa L\log (\kappa L T)$$ of success. We assume an oblivious adversarial scheduler, which does not make decisions based on the operations, but can predetermine any schedule for the processes, which is unknown to our algorithm. Furthermore, to account for applications that change their future requests based on the results of previous tryLock operations, we strengthen the adversary by allowing decisions of the start times and lock sets of tryLock operations to be made adaptively, given the history of the execution so far.more » « less
-
Estimating the geometric median of a dataset is a robust counterpart to mean estimation, and is a fundamental problem in computational geometry. Recently, [HSU24] gave an (ε,δ)-differentially private algorithm obtaining an α-multiplicative approximation to the geometric median objective, 1n∑i∈[n]‖⋅−xi‖, given a dataset :={xi}i∈[n]⊂ℝd. Their algorithm requires n≳d‾‾√⋅1αε samples, which they prove is information-theoretically optimal. This result is surprising because its error scales with the \emph{effective radius} of (i.e., of a ball capturing most points), rather than the worst-case radius. We give an improved algorithm that obtains the same approximation quality, also using n≳d‾‾√⋅1αϵ samples, but in time O˜(nd+dα2). Our runtime is nearly-linear, plus the cost of the cheapest non-private first-order method due to [CLM+16]. To achieve our results, we use subsampling and geometric aggregation tools inspired by FriendlyCore [TCK+22] to speed up the "warm start" component of the [HSU24] algorithm, combined with a careful custom analysis of DP-SGD's sensitivity for the geometric median objective.more » « less
An official website of the United States government

