Algorithms often carry out equally many computations for “easy” and “hard” problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate knearestneighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We pro pose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.
This content will become publicly available on March 18, 2023
RandomWalk Based Approximate kNearest Neighbors Algorithm for Diffusion State Distance
Diffusion State Distance (DSD) is a datadependent metric that compares data points using a datadriven diffusion process and provides a powerful tool for learning the underlying structure of highdimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new randomwalk based algorithm that empirically finds approximate knearest neighbors accurately in an efficient manner. Numerical results for realworld proteinprotein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate knearest neighbors performs well when used to predict proteins’ functional labels.
 Publication Date:
 NSFPAR ID:
 10346855
 Journal Name:
 LargeScale Scientific Computing. LSSC 2021, Springer Lecture Notes in Computer Science
 Sponsoring Org:
 National Science Foundation
More Like this


To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constantQ transform (CQT) domain to build a Knearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a threedimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness mapping function, and number of neighbors K.

Many large multimedia applications require efficient processing of nearest neighbor queries. Often, multimedia data are represented as a collection of important highdimensional feature vectors. Existing Locality Sensitive Hashing (LSH) techniques require users to find topk similar feature vectors for each of the feature vectors that represent the query object. This leads to wasted and redundant work due to two main reasons: 1) not all feature vectors may contribute equally in finding the topk similar multimedia objects, and 2) feature vectors are treated independently during query processing. Additionally, there is no theoretical guarantee on the returned multimedia results. In this work, we propose a practical and efficient indexing approach for finding topk approximate nearest neighbors for multimedia data using LSH called mmLSH, which can provide theoretical guarantees on the returned multimedia results. Additionally, we present a bufferconscious strategy to speed up the query processing. Experimental evaluation shows significant gains in performance time and accuracy for different real multimedia datasets when compared against stateoftheart LSH techniques.

The Knearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point q, we want to assign a label to q based on the labels of the Knearest points to the query. We study this problem in the kmachine model, a model for distributed largescale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to compute an answer given a query point to a machine using a small number of communication rounds. Our main result is a randomized algorithm in the kmachine model that runs in O(log K) communication rounds with high success probability (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(k log K) messages. Our bounds are essentially the best possible for comparisonbased algorithms. We also implemented our algorithm and show that it performs well in practice.

Mobile and IoT scenarios increasingly involve interactive and computation intensive contextual recognition. Existing optimizations typically resort to computation offloading or simplified ondevice processing. Instead, we observe that the same application is often invoked on multiple devices in close proximity. Moreover, the application instances often process similar contextual data that map to the same outcome. In this paper, we propose crossdevice approximate computation reuse, which minimizes redundant computation by harnessing the “equivalence” between different input values and reusing previously computed outputs with high confidence. We devise adaptive locality sensitive hashing (ALSH) and homogenized k nearest neighbors (HkNN). The former achieves scalable and constant lookup, while the latter provides highquality reuse and tunable accuracy guarantee. We further incorporate approximate reuse as a service, called FoggyCache, in the computation offloading runtime. Extensive evaluation shows that, when given 95% accuracy target, FoggyCache consistently harnesses over 90% of reuse opportunities, which translates to reduced computation latency and energy consumption by a factor of 3 to 10.