skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Nearest neighbor search under uncertainty
Nearest Neighbor Search (NNS) is a central task in knowledge representation, learning, and reasoning. There is vast literature on efficient algorithms for constructing data structures and performing exact and approximate NNS. This paper studies NNS under Uncertainty (NNSU). Specifically, consider the setting in which an NNS algorithm has access only to a stochastic distance oracle that provides a noisy, unbiased estimate of the distance between any pair of points, rather than the exact distance. This models many situations of practical importance, including NNS based on human similarity judgements, physical measurements, or fast, randomized approximations to exact distances. A naive approach to NNSU could employ any standard NNS algorithm and repeatedly query and average results from the stochastic oracle (to reduce noise) whenever it needs a pairwise distance. The problem is that a sufficient number of repeated queries is unknown in advance; e.g., a point may be distant from all but one other point (crude distance estimates suffice) or it may be close to a large number of other points (accurate estimates are necessary). This paper shows how ideas from cover trees and multi-armed bandits can be leveraged to develop an NNSU algorithm that has optimal dependence on the dataset size and the (unknown) geometry of the dataset.  more » « less
Award ID(s):
2023239
NSF-PAR ID:
10533110
Author(s) / Creator(s):
; ;
Publisher / Repository:
Uncertainty in Artificial Intelligence
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Czumaj, Artur ; Dawar, Anuj ; Merelli, Emanuela (Ed.)
    Consider a set P ⊆ ℝ^d of n points, and a convex body C provided via a separation oracle. The task at hand is to decide for each point of P if it is in C using the fewest number of oracle queries. We show that one can solve this problem in two and three dimensions using O(⬡_P log n) queries, where ⬡_P is the largest subset of points of P in convex position. In 2D, we provide an algorithm which efficiently generates these adaptive queries. Furthermore, we show that in two dimensions one can solve this problem using O(⊚(P,C) log² n) oracle queries, where ⊚(P,C) is a lower bound on the minimum number of queries that any algorithm for this specific instance requires. Finally, we consider other variations on the problem, such as using the fewest number of queries to decide if C contains all points of P. As an application of the above, we show that the discrete geometric median of a point set P in ℝ² can be computed in O(n log² n (log n log log n + ⬡(P))) expected time. 
    more » « less
  2. We give nearly matching upper and lower bounds on the oracle complexity of finding ϵ-stationary points (∥∇F(x)∥≤ϵ in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent “recursive regularization” technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff 
    more » « less
  3. null (Ed.)
    This paper presents a novel accelerated exact k-means called as "Ball k-means" by using the ball to describe each cluster, which focus on reducing the point-centroid distance computation. It can exactly find its neighbor clusters for each cluster, resulting distance computations only between a point and its neighbor clusters' centroids instead of all centroids. What's more, each cluster can be divided into "stable area" and "active area", and the latter one is further divided into some exact "annular area". The assignment of the points in the "stable area" is not changed while the points in each "annular area" will be adjusted within a few neighbor clusters. There are no upper or lower bounds in the whole process. Moreover, ball k-means uses ball clusters and neighbor searching along with multiple novel stratagems for reducing centroid distance computations. In comparison with the current state-of-the art accelerated exact bounded methods, the Yinyang algorithm and the Exponion algorithm, as well as other top-of-the-line tree-based and bounded methods, the ball k-means attains both higher performance and performs fewer distance calculations, especially for large-k problems. The faster speed, no extra parameters and simpler design of "Ball k-means" make it an all-around replacement of the naive k-means. 
    more » « less
  4. Czumaj, Artur ; Xin, Qin (Ed.)
    Representation of Euclidean objects in a digital space has been a focus of research for over 30 years. Digital line segments are particularly important as other digital objects depend on their definition (e.g., digital convex objects or digital star-shaped objects). It may be desirable for the digital line segment systems to satisfy some nice properties that their Euclidean counterparts also satisfy. The system is a consistent digital line segment system (CDS) if it satisfies five properties, most notably the subsegment property (the intersection of any two digital line segments should be connected) and the prolongation property (any digital line segment should be able to be extended into a digital line). It is known that any CDS must have Ω(log n) Hausdorff distance to their Euclidean counterparts, where n is the number of grid points on a segment. In fact this lower bound even applies to consistent digital rays (CDR) where for a fixed p ∈ ℤ², we consider the digital segments from p to q for each q ∈ ℤ². In this paper, we consider families of weak consistent digital rays (WCDR) where we maintain four of the CDR properties but exclude the prolongation property. In this paper, we give a WCDR construction that has optimal Hausdorff distance to the exact constant. That is, we give a construction whose Hausdorff distance is 1.5 under the L_∞ metric, and we show that for every ε > 0, it is not possible to have a WCDR with Hausdorff distance at most 1.5 - ε. 
    more » « less
  5. Data-sensitive metrics adapt distances locally based the density of data points with the goal of aligning distances and some notion of similarity. In this paper, we give the first exact algorithm for computing a data-sensitive metric called the nearest neighbor metric. In fact, we prove the surprising result that a previously published 3-approximation is an exact algorithm. The nearest neighbor metric can be viewed as a special case of a density-based distance used in machine learning, or it can be seen as an example of a manifold metric. Previous computational research on such metrics despaired of computing exact distances on account of the apparent difficulty of minimizing over all continuous paths between a pair of points. We leverage the exact computation of the nearest neighbor metric to compute sparse spanners and persistent homology. We also explore the behavior of the metric built from point sets drawn from an underlying distribution and consider the more general case of inputs that are finite collections of path-connected compact sets. The main results connect several classical theories such as the conformal change of Riemannian metrics, the theory of positive definite functions of Schoenberg, and screw function theory of Schoenberg and Von Neumann. We also develop some novel proof techniques based on the combination of screw functions and Lipschitz extensions that may be of independent interest. 
    more » « less