The Knearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point q, we want to assign a label to q based on the labels of the Knearest points to the query. We study this problem in the kmachine model, a model for distributed largescale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to compute an answer given a query point to a machine using a small number of communication rounds.
Our main result is a randomized algorithm in the kmachine model that runs in O(log K) communication rounds with high success probability (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(k log K) messages. Our bounds are essentially the best possible for comparisonbased algorithms. We also implemented our algorithm and show that it performs well in practice.
more »
« less
Adaptive Estimation for Approximate kNearestNeighbor Computations
Algorithms often carry out equally many computations for “easy” and “hard” problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate knearestneighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We pro pose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.
more »
« less
 Award ID(s):
 1816986
 NSFPAR ID:
 10100187
 Date Published:
 Journal Name:
 International Conference on Artificial Intelligence and Statistics
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Diffusion State Distance (DSD) is a datadependent metric that compares data points using a datadriven diffusion process and provides a powerful tool for learning the underlying structure of highdimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new randomwalk based algorithm that empirically finds approximate knearest neighbors accurately in an efficient manner. Numerical results for realworld proteinprotein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate knearest neighbors performs well when used to predict proteins’ functional labels.more » « less

Given a data set of size n in d'dimensional Euclidean space, the kmeans problem asks for a set of k points (called centers) such that the sum of the l_2^2distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private kmeans clustering algorithms in both the central and local settings. In this work, we introduce a new locally private kmeans clustering algorithm that achieves nearoptimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^21))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.more » « less

Bouamor, Houda ; Pino, Juan ; Bali, Kalika (Ed.)Crossencoder models, which jointly encode and score a queryitem pair, are prohibitively expensive for direct knearest neighbor (kNN) search. Consequently, kNN search typically employs a fast approximate retrieval (e.g. using BM25 or dualencoder vectors), followed by reranking with a crossencoder; however, the retrieval approximation often has detrimental recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent work that employs a crossencoder only, making search efficient using a relatively small number of anchor items, and a CUR matrix factorization. While ANNCUR’s onetime selection of anchors tends to approximate the crossencoder distances on average, doing so forfeits the capacity to accurately estimate distances to items near the query, leading to regret in the crucial endtask: recall of topk items. In this paper, we propose ADACUR, a method that adaptively, iteratively, and efficiently minimizes the approximation error for the practically important topk neighbors. It does so by iteratively performing kNN search using the anchors available so far, then adding these retrieved nearest neighbors to the anchor set for the next round. Empirically, on multiple datasets, in comparison to previous traditional and stateoftheart methods such as ANNCUR and dualencoderbased retrieveandrerank, our proposed approach ADACUR consistently reduces recall error—by up to 70% on the important k = 1 setting—while using no more compute than its competitors.more » « less

We present a set of parallel algorithms for computing exact knearest neighbors in low dimensions. Many knearest neighbor algorithms use either a kdtree or the Morton ordering of the point set; our algorithms combine these approaches using a data structure we call the zdtree. We show that this combination is both theoretically efficient under common assumptions, and fast in practice. For point sets of size n with bounded expansion constant and bounded ratio, the zdtree can be built in O(n) work with O(n^ε) span for constant ε < 1, and searching for the knearest neighbors of a point takes expected O(k log k) time. We benchmark our knearest neighbor algorithms against existing parallel knearest neighbor algorithms, showing that our implementations are generally faster than the state of the art as well as achieving 75x speedup on 144 hyperthreads. Furthermore, the zdtree supports parallel batchdynamic insertions and deletions; to our knowledge, it is the first knearest neighbor data structure to support such updates. On point sets with bounded expansion constant and bounded ratio, a batchdynamic update of size k requires O(k log n/k) work with O(k^ε + polylog(n)) span.more » « less