Algorithms often carry out equally many computations for “easy” and “hard” problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate knearestneighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We pro pose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.
This content will become publicly available on March 18, 2023
RandomWalk Based Approximate kNearest Neighbors Algorithm for Diffusion State Distance
Diffusion State Distance (DSD) is a datadependent metric that compares data points using a datadriven diffusion process and provides a powerful tool for learning the underlying structure of highdimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new randomwalk based algorithm that empirically finds approximate knearest neighbors accurately in an efficient manner. Numerical results for realworld proteinprotein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate knearest neighbors performs well when used to predict proteins’ functional labels.
 Publication Date:
 NSFPAR ID:
 10346855
 Journal Name:
 LargeScale Scientific Computing. LSSC 2021, Springer Lecture Notes in Computer Science
 Sponsoring Org:
 National Science Foundation
More Like this


To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constantQ transform (CQT) domain to build a Knearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a threedimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness mapping function, and number of neighbors K.

Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important highdimensional feature vectors. Existing Locality Sensitive Hashing (LSH) techniques require users to find topk similar feature vectors for each of the feature vectors that represent the query object. This leads to wasted and redundant work due to two main reasons: 1) not all feature vectors may contribute equally in finding the topk similar multimedia objects, and 2) feature vectors are treated independently during query processing. Additionally, there is no theoretical guarantee on the returned multimedia results. In this work, we propose a practical and efficient indexing approach for finding topk approximate nearest neighbors for multimedia data using LSH called mmLSH, which can provide theoretical guarantees on the returned multimedia results. Additionally, we present a bufferconscious strategy to speed up the query processing. Experimental evaluation shows significant gains in performance time and accuracy for different real multimedia datasets when compared against stateoftheart LSH techniques.

Abstract. Similaritysearchisafundamentalbuildingblockforinformation retrieval on a variety of datasets. The notion of a neighbor is often based on binary considerations, such as the k nearest neighbors. However, considering that data is often organized as a manifold with low intrinsic dimension, the notion of a neighbor must recognize higherorder relationship, to capture neighbors in all directions. Proximity graphs such as the Relative Neighbor Graphs (RNG), use trinary relationships which capture the notion of direc tion and have been successfully used in a number of applications. However, the current algorithms for computing the RNG, despite widespread use, are approximate and not scalable. This paper proposes a novel type of graph, the Generalized Relative Neighborhood Graph (GRNG) for use in a pivot layer that then guides the efficient and exact construction of the RNG of a set of exemplars. It also shows how to extend this to a multilayer hier archy which significantly improves over the stateoftheart methods which can only construct an approximate RNG.

The Knearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point q, we want to assign a label to q based on the labels of the Knearest points to the query. We study this problem in the kmachine model, a model for distributed largescale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to compute an answer given a query point to a machine using a small number of communication rounds. Our main result is a randomized algorithm in the kmachine model that runs in O(log K) communication rounds with high success probability (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(k log K) messages. Our bounds are essentially the best possible for comparisonbased algorithms. We also implemented our algorithm and show that it performs well in practice.