skip to main content

Title: Adaptive Estimation for Approximate k-Nearest-Neighbor Computations
Algorithms often carry out equally many computations for “easy” and “hard” problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate k-nearest-neighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We pro- pose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.
; ;
Award ID(s):
Publication Date:
Journal Name:
International Conference on Artificial Intelligence and Statistics
Sponsoring Org:
National Science Foundation
More Like this
  1. The K-nearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point q, we want to assign a label to q based on the labels of the K-nearest points to the query. We study this problem in the k-machine model, a model for distributed large-scale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to compute an answer given a query point to a machine using a small number of communication rounds. Our main result is a randomized algorithm in the k-machine model that runs in O(log K) communication rounds with high success probability (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(k log K) messages. Our bounds are essentially the best possible for comparison-based algorithms. We also implemented our algorithm and show that it performs well in practice.
  2. Diffusion State Distance (DSD) is a data-dependent metric that compares data points using a data-driven diffusion process and provides a powerful tool for learning the underlying structure of high-dimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new random-walk based algorithm that empirically finds approximate k-nearest neighbors accurately in an efficient manner. Numerical results for real-world protein-protein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate k-nearest neighbors performs well when used to predict proteins’ functional labels.
  3. Given a data set of size n in d'-dimensional Euclidean space, the k-means problem asks for a set of k points (called centers) such that the sum of the l_2^2-distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private k-means clustering algorithms in both the central and local settings. In this work, we introduce a new locally private k-means clustering algorithm that achieves near-optimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^2-1))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.
  4. Braverman, Mark (Ed.)
    We develop approximation algorithms for set-selection problems with deterministic constraints, but random objective values, i.e., stochastic probing problems. When the goal is to maximize the objective, approximation algorithms for probing problems are well-studied. On the other hand, few techniques are known for minimizing the objective, especially in the adaptive setting, where information about the random objective is revealed during the set-selection process and allowed to influence it. For minimization problems in particular, incorporating adaptivity can have a considerable effect on performance. In this work, we seek approximation algorithms that compare well to the optimal adaptive policy. We develop new techniques for adaptive minimization, applying them to a few problems of interest. The core technique we develop here is an approximate reduction from an adaptive expectation minimization problem to a set of adaptive probability minimization problems which we call threshold problems. By providing near-optimal solutions to these threshold problems, we obtain bicriteria adaptive policies. We apply this method to obtain an adaptive approximation algorithm for the Min-Element problem, where the goal is to adaptively pick random variables to minimize the expected minimum value seen among them, subject to a knapsack constraint. This partially resolves an open problem raised in [Goel etmore »al., 2010]. We further consider three extensions on the Min-Element problem, where our objective is the sum of the smallest k element-weights, or the weight of the min-weight basis of a given matroid, or where the constraint is not given by a knapsack but by a matroid constraint. For all three of the variations we explore, we develop adaptive approximation algorithms for their corresponding threshold problems, and prove their near-optimality via coupling arguments.« less
  5. We present a new technique for efficiently removing almost all short cycles in a graph without unintentionally removing its triangles. Consequently, triangle finding problems do not become easy even in almost k-cycle free graphs, for any constant k≥ 4. Triangle finding is at the base of many conditional lower bounds in P, mainly for distance computation problems, and the existence of many 4- or 5-cycles in a worst-case instance had been the obstacle towards resolving major open questions. Hardness of approximation: Are there distance oracles with m1+o(1) preprocessing time and mo(1) query time that achieve a constant approximation? Existing algorithms with such desirable time bounds only achieve super-constant approximation factors, while only 3− factors were conditionally ruled out (Pătraşcu, Roditty, and Thorup; FOCS 2012). We prove that no O(1) approximations are possible, assuming the 3-SUM or APSP conjectures. In particular, we prove that k-approximations require Ω(m1+1/ck) time, which is tight up to the constant c. The lower bound holds even for the offline version where we are given the queries in advance, and extends to other problems such as dynamic shortest paths. The 4-Cycle problem: An infamous open question in fine-grained complexity is to establish any surprising consequences from amore »subquadratic or even linear-time algorithm for detecting a 4-cycle in a graph. This is arguably one of the simplest problems without a near-linear time algorithm nor a conditional lower bound. We prove that Ω(m1.1194) time is needed for k-cycle detection for all k≥ 4, unless we can detect a triangle in √n-degree graphs in O(n2−δ) time; a breakthrough that is not known to follow even from optimal matrix multiplication algorithms.« less