The Knearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point q, we want to assign a label to q based on the labels of the Knearest points to the query. We study this problem in the kmachine model, a model for distributed largescale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to compute an answer given a query point to a machine using a small number of communication rounds. Our main result is a randomized algorithm in the kmachine model that runs in O(log K) communication rounds with high success probability (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(k log K) messages. Our bounds are essentially the best possible for comparisonbased algorithms. We also implemented our algorithm and show that it performs well in practice.
Adaptive Estimation for Approximate kNearestNeighbor Computations
Algorithms often carry out equally many computations for “easy” and “hard” problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate knearestneighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We pro pose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.
 Award ID(s):
 1816986
 Publication Date:
 NSFPAR ID:
 10100187
 Journal Name:
 International Conference on Artificial Intelligence and Statistics
 Sponsoring Org:
 National Science Foundation
More Like this


Diffusion State Distance (DSD) is a datadependent metric that compares data points using a datadriven diffusion process and provides a powerful tool for learning the underlying structure of highdimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new randomwalk based algorithm that empirically finds approximate knearest neighbors accurately in an efficient manner. Numerical results for realworld proteinprotein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate knearest neighbors performs well when used to predict proteins’ functional labels.

Given a data set of size n in d'dimensional Euclidean space, the kmeans problem asks for a set of k points (called centers) such that the sum of the l_2^2distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private kmeans clustering algorithms in both the central and local settings. In this work, we introduce a new locally private kmeans clustering algorithm that achieves nearoptimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^21))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.

Braverman, Mark (Ed.)We develop approximation algorithms for setselection problems with deterministic constraints, but random objective values, i.e., stochastic probing problems. When the goal is to maximize the objective, approximation algorithms for probing problems are wellstudied. On the other hand, few techniques are known for minimizing the objective, especially in the adaptive setting, where information about the random objective is revealed during the setselection process and allowed to influence it. For minimization problems in particular, incorporating adaptivity can have a considerable effect on performance. In this work, we seek approximation algorithms that compare well to the optimal adaptive policy. We develop new techniques for adaptive minimization, applying them to a few problems of interest. The core technique we develop here is an approximate reduction from an adaptive expectation minimization problem to a set of adaptive probability minimization problems which we call threshold problems. By providing nearoptimal solutions to these threshold problems, we obtain bicriteria adaptive policies. We apply this method to obtain an adaptive approximation algorithm for the MinElement problem, where the goal is to adaptively pick random variables to minimize the expected minimum value seen among them, subject to a knapsack constraint. This partially resolves an open problem raised in [Goel etmore »

We present a new technique for efficiently removing almost all short cycles in a graph without unintentionally removing its triangles. Consequently, triangle finding problems do not become easy even in almost kcycle free graphs, for any constant k≥ 4. Triangle finding is at the base of many conditional lower bounds in P, mainly for distance computation problems, and the existence of many 4 or 5cycles in a worstcase instance had been the obstacle towards resolving major open questions. Hardness of approximation: Are there distance oracles with m1+o(1) preprocessing time and mo(1) query time that achieve a constant approximation? Existing algorithms with such desirable time bounds only achieve superconstant approximation factors, while only 3− factors were conditionally ruled out (Pătraşcu, Roditty, and Thorup; FOCS 2012). We prove that no O(1) approximations are possible, assuming the 3SUM or APSP conjectures. In particular, we prove that kapproximations require Ω(m1+1/ck) time, which is tight up to the constant c. The lower bound holds even for the offline version where we are given the queries in advance, and extends to other problems such as dynamic shortest paths. The 4Cycle problem: An infamous open question in finegrained complexity is to establish any surprising consequences from amore »