Selfsupervised metric learning has been a successful approach for learning a distance from an unlabeled dataset. The resulting distance is broadly useful for improving various distancebased downstream tasks, even when no information from downstream tasks is utilized in the metric learning stage. To gain insights into this approach, we develop a statistical framework to theoretically study how selfsupervised metric learning can benefit downstream tasks in the context of multiview data. Under this framework, we show that the target distance of metric learning satisfies several desired properties for the downstream tasks. On the other hand, our investigation suggests the target distance can be further improved by moderating each direction’s weights. In addition, our analysis precisely characterizes the improvement by selfsupervised metric learning on four commonly used downstream tasks: sample identification, twosample testing, kmeans clustering, and knearest neighbor classification. When the distance is estimated from an unlabeled dataset, we establish the upper bound on distance estimation’s accuracy and the number of samples sufficient for downstream task
improvement. Finally, numerical experiments are presented to support the theoretical results in the paper.
more »
« less
A Simple Algorithm for kNN Sampling in General Metrics
Finding the kth nearest neighbor to a query point is a ubiquitous operation in many types of metric computations, especially those in unsupervised machine learning. In many such cases, the distance to k sample points is used as an estimate of the local density of the sample. In this paper, we give an algorithm that takes a finite metric (P,d) and an integer k and produces a subset S ⊆ P with the property that for any q ∈ P, the distance to the second nearest point of S to q is a constant factor approximation to the distance to the kth nearest point of P to q. Thus, the sample S may be used in lieu of P. In addition to being much smaller than P, the distance queries on S only require finding the second nearest neighbor instead of the kth nearest neighbor. This is a significant improvement, especially because theoretical guarantees on kth nearest neighbor methods often require k to grow as a function of the input size n.
more »
« less
 Award ID(s):
 2017980
 NSFPAR ID:
 10211947
 Editor(s):
 Keil, Mark; Mondal, Debajyoti
 Date Published:
 Journal Name:
 Proceedings of the 32nd Canadian Conference on Computational Geometry
 Page Range / eLocation ID:
 345  351
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Algorithms often carry out equally many computations for “easy” and “hard” problem instances. In particular, algorithms for finding nearest neighbors typically have the same running time regardless of the particular problem instance. In this paper, we consider the approximate knearestneighbor problem, which is the problem of finding a subset of O(k) points in a given set of points that contains the set of k nearest neighbors of a given query point. We pro pose an algorithm based on adaptively estimating the distances, and show that it is essentially optimal out of algorithms that are only allowed to adaptively estimate distances. We then demonstrate both theoretically and experimentally that the algorithm can achieve significant speedups relative to the naive method.more » « less

We present a set of parallel algorithms for computing exact knearest neighbors in low dimensions. Many knearest neighbor algorithms use either a kdtree or the Morton ordering of the point set; our algorithms combine these approaches using a data structure we call the zdtree. We show that this combination is both theoretically efficient under common assumptions, and fast in practice. For point sets of size n with bounded expansion constant and bounded ratio, the zdtree can be built in O(n) work with O(n^ε) span for constant ε < 1, and searching for the knearest neighbors of a point takes expected O(k log k) time. We benchmark our knearest neighbor algorithms against existing parallel knearest neighbor algorithms, showing that our implementations are generally faster than the state of the art as well as achieving 75x speedup on 144 hyperthreads. Furthermore, the zdtree supports parallel batchdynamic insertions and deletions; to our knowledge, it is the first knearest neighbor data structure to support such updates. On point sets with bounded expansion constant and bounded ratio, a batchdynamic update of size k requires O(k log n/k) work with O(k^ε + polylog(n)) span.more » « less

Datasensitive metrics adapt distances locally based the density of data points with the goal of aligning distances and some notion of similarity. In this paper, we give the first exact algorithm for computing a datasensitive metric called the nearest neighbor metric. In fact, we prove the surprising result that a previously published 3approximation is an exact algorithm. The nearest neighbor metric can be viewed as a special case of a densitybased distance used in machine learning, or it can be seen as an example of a manifold metric. Previous computational research on such metrics despaired of computing exact distances on account of the apparent difficulty of minimizing over all continuous paths between a pair of points. We leverage the exact computation of the nearest neighbor metric to compute sparse spanners and persistent homology. We also explore the behavior of the metric built from point sets drawn from an underlying distribution and consider the more general case of inputs that are finite collections of pathconnected compact sets. The main results connect several classical theories such as the conformal change of Riemannian metrics, the theory of positive definite functions of Schoenberg, and screw function theory of Schoenberg and Von Neumann. We also develop some novel proof techniques based on the combination of screw functions and Lipschitz extensions that may be of independent interest.more » « less

Diffusion State Distance (DSD) is a datadependent metric that compares data points using a datadriven diffusion process and provides a powerful tool for learning the underlying structure of highdimensional data. While finding the exact nearest neighbors in the DSD metric is computationally expensive, in this paper, we propose a new randomwalk based algorithm that empirically finds approximate knearest neighbors accurately in an efficient manner. Numerical results for realworld proteinprotein interaction networks are presented to illustrate the efficiency and robustness of the proposed algorithm. The set of approximate knearest neighbors performs well when used to predict proteins’ functional labels.more » « less