Algorithms often carry out equally many computations for “easy” and “hard” problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate knearestneighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We pro pose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.
more »
« less
RandomWalk Based Approximate kNearest Neighbors Algorithm for Diffusion State Distance
Diffusion State Distance (DSD) is a datadependent metric that compares data points using a datadriven diffusion process and provides a powerful tool for learning the underlying structure of highdimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new randomwalk based algorithm that empirically finds approximate knearest neighbors accurately in an efficient manner. Numerical results for realworld proteinprotein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate knearest neighbors performs well when used to predict proteins’ functional labels.
more »
« less
 NSFPAR ID:
 10346855
 Date Published:
 Journal Name:
 LargeScale Scientific Computing. LSSC 2021, Springer Lecture Notes in Computer Science
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


The Knearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point q, we want to assign a label to q based on the labels of the Knearest points to the query. We study this problem in the kmachine model, a model for distributed largescale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to compute an answer given a query point to a machine using a small number of communication rounds. Our main result is a randomized algorithm in the kmachine model that runs in O(log K) communication rounds with high success probability (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(k log K) messages. Our bounds are essentially the best possible for comparisonbased algorithms. We also implemented our algorithm and show that it performs well in practice.more » « less

Given a collection of vectors, the approximate Knearestneighbor graph (KGraph for short) connects every vector to its approximate Knearestneighbors (KNN for short). KGraph plays an important role in high dimensional data visualization, semantic search, manifold learning, and machine learning. The vectors are typically vector representations of realworld objects (e.g., images and documents), which often come with a few structured attributes, such as timestamps and locations. In this paper, we study the allrange approximate Knearestneighbor graph (ARKGraph) problem. Specifically, given a collection of vectors, each associated with a numerical search key (e.g., a timestamp), we aim to build an index that takes a search key range as the query and returns the KGraph of vectors whose search keys are within the query range. ARKGraph can facilitate interactive high dimensional data visualization, data mining, etc. A key challenge of this problem is the huge index size. This is because, given n vectors, a bruteforce index stores a KGraph for every search key range, which results in O (K n 3 ) index size as there are O ( n 2 ) search key ranges and each KGraph takes O (K n ) space. We observe that the KNN of a vector in nearby ranges are often the same, which can be grouped together to save space. Based on this observation, we propose a series of novel techniques that reduce the index size significantly to just O (K n log n ) in the average case. Furthermore, we develop an efficient indexing algorithm that constructs the optimized ARKGraph index directly without exhaustively calculating the distance between every pair of vectors. To process a query, for each vector in the query range, we only need O (log log n + K log K) to restore its KNN in the query range from the optimized ARKGraph index. We conducted extensive experiments on realworld datasets. Experimental results show that our optimized ARKGraph index achieved a small index size, low query latency, and good scalability. Specifically, our approach was 1000x faster than the baseline method that builds a KGraph for all the vectors in the query range onthefly.more » « less

We present a set of parallel algorithms for computing exact knearest neighbors in low dimensions. Many knearest neighbor algorithms use either a kdtree or the Morton ordering of the point set; our algorithms combine these approaches using a data structure we call the zdtree. We show that this combination is both theoretically efficient under common assumptions, and fast in practice. For point sets of size n with bounded expansion constant and bounded ratio, the zdtree can be built in O(n) work with O(n^ε) span for constant ε < 1, and searching for the knearest neighbors of a point takes expected O(k log k) time. We benchmark our knearest neighbor algorithms against existing parallel knearest neighbor algorithms, showing that our implementations are generally faster than the state of the art as well as achieving 75x speedup on 144 hyperthreads. Furthermore, the zdtree supports parallel batchdynamic insertions and deletions; to our knowledge, it is the first knearest neighbor data structure to support such updates. On point sets with bounded expansion constant and bounded ratio, a batchdynamic update of size k requires O(k log n/k) work with O(k^ε + polylog(n)) span.more » « less

null (Ed.)To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constantQ transform (CQT) domain to build a Knearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a threedimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness mapping function, and number of neighbors K.more » « less