skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


Title: Efficient Distributed Algorithms for the K-Nearest Neighbors Problem
The K-nearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of n data points with labels and a query point q, we want to assign a label to q based on the labels of the K-nearest points to the query. We study this problem in the k-machine model, a model for distributed large-scale data. In this model, we assume that the n points are distributed (in a balanced fashion) among the k machines and the goal is to compute an answer given a query point to a machine using a small number of communication rounds. Our main result is a randomized algorithm in the k-machine model that runs in O(log K) communication rounds with high success probability (regardless of the number of machines k and the number of points n). The message complexity of the algorithm is small taking only O(k log K) messages. Our bounds are essentially the best possible for comparison-based algorithms. We also implemented our algorithm and show that it performs well in practice.  more » « less
Award ID(s):
1633720
PAR ID:
10197167
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
SPAA '20: Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures
Page Range / eLocation ID:
527 to 529
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Given a set $P$ of $n$ points in the plane, we consider the problem of computing the number of points of $P$ in a query unit disk (i.e., all query disks have the same radius). We show that the main techniques for simplex range searching in the plane can be adapted to this problem. For example, by adapting Matoušek's results, we can build a data structure of $O(n)$ space in $O(n^{1+\delta})$ time (for any $\delta>0$) so that each query can be answered in $O(\sqrt{n})$ time; alternatively, we can build a data structure of $O(n^2/\log^2 n)$ space with $O(n^{1+\delta})$ preprocessing time (for any $\delta>0$) and $O(\log n)$ query time. Our techniques lead to improvements for several other classical problems in computational geometry. 1. Given a set of $n$ unit disks and a set of $n$ points in the plane, the batched unit-disk range counting problem is to compute for each disk the number of points in it. Previous work [Katz and Sharir, 1997] solved the problem in $O(n^{4/3}\log n)$ time. We give a new algorithm of $O(n^{4/3})$ time, which is optimal as it matches an $\Omega(n^{4/3})$-time lower bound. For small $\chi$, where $\chi$ is the number of pairs of unit disks that intersect, we further improve the algorithm to $O(n^{2/3}\chi^{1/3}+n^{1+\delta})$ time, for any $\delta>0$. 2. The above result immediately leads to an $O(n^{4/3})$ time optimal algorithm for counting the intersecting pairs of circles for a set of $n$ unit circles in the plane. The previous best algorithms solve the problem in $O(n^{4/3}\log n)$ deterministic time [Katz and Sharir, 1997] or in $O(n^{4/3}\log^{2/3} n)$ expected time by a randomized algorithm [Agarwal, Pellegrini, and Sharir, 1993]. 3. Given a set $P$ of $n$ points in the plane and an integer $k$, the distance selection problem is to find the $k$-th smallest distance among all pairwise distances of $P$. The problem can be solved in $O(n^{4/3}\log^2 n)$ deterministic time [Katz and Sharir, 1997] or in $O(n\log n+n^{2/3}k^{1/3}\log^{5/3}n)$ expected time by a randomized algorithm [Chan, 2001]. Our new randomized algorithm runs in $O(n\log n +n^{2/3}k^{1/3}\log n)$ expected time. 4. Given a set $P$ of $n$ points in the plane, the discrete $2$-center problem is to compute two smallest congruent disks whose centers are in $P$ and whose union covers $P$. An $O(n^{4/3}\log^5 n)$-time algorithm was known [Agarwal, Sharir, and Welzl, 1998]. Our techniques yield a deterministic algorithm of $O(n^{4/3}\log^{10/3} n\cdot (\log\log n)^{O(1)})$ time and a randomized algorithm of $O(n^{4/3}\log^3 n\cdot (\log\log n)^{1/3})$ expected time. 
    more » « less
  2. Pankratov, Denis (Ed.)
    Given a set $P$ of $n$ points in the plane, and a parameter $k$, we present an algorithm, whose running time is $O(n^{3/2} \sqrt{k}\log^{3/2} n + kn\log^2 n)$, with high probability, that computes a subset $Q* \subseteq P$ of $k$ points, that minimizes the Hausdorff distance between the convex-hulls of $Q*$ and $P$. This is the first subquadratic algorithm for this problem if $k$ is small. 
    more » « less
  3. Given a set of pairwise disjoint polygonal obstacles in the plane, finding an obstacle-avoiding Euclidean shortest path between two points is a classical problem in computational geometry and has been studied extensively. Previously, Hershberger and Suri (in SIAM Journal on Computing , 1999) gave an algorithm of O(n log n ) time and O(n log n ) space, where n is the total number of vertices of all obstacles. Recently, by modifying Hershberger and Suri’s algorithm, Wang (in SODA’21) reduced the space to O(n) while the runtime of the algorithm is still O(n log n ). In this article, we present a new algorithm of O(n+h log h ) time and O(n) space, provided that a triangulation of the free space is given, where h is the number of obstacles. The algorithm is better than the previous work when h is relatively small. Our algorithm builds a shortest path map for a source point s so that given any query point t , the shortest path length from s to t can be computed in O (log n ) time and a shortest s - t path can be produced in additional time linear in the number of edges of the path. 
    more » « less
  4. Given a collection of vectors, the approximate K-nearest-neighbor graph (KGraph for short) connects every vector to its approximate K-nearest-neighbors (KNN for short). KGraph plays an important role in high dimensional data visualization, semantic search, manifold learning, and machine learning. The vectors are typically vector representations of real-world objects (e.g., images and documents), which often come with a few structured attributes, such as times-tamps and locations. In this paper, we study the all-range approximate K-nearest-neighbor graph (ARKGraph) problem. Specifically, given a collection of vectors, each associated with a numerical search key (e.g., a timestamp), we aim to build an index that takes a search key range as the query and returns the KGraph of vectors whose search keys are within the query range. ARKGraph can facilitate interactive high dimensional data visualization, data mining, etc. A key challenge of this problem is the huge index size. This is because, given n vectors, a brute-force index stores a KGraph for every search key range, which results in O (K n 3 ) index size as there are O ( n 2 ) search key ranges and each KGraph takes O (K n ) space. We observe that the KNN of a vector in nearby ranges are often the same, which can be grouped together to save space. Based on this observation, we propose a series of novel techniques that reduce the index size significantly to just O (K n log n ) in the average case. Furthermore, we develop an efficient indexing algorithm that constructs the optimized ARKGraph index directly without exhaustively calculating the distance between every pair of vectors. To process a query, for each vector in the query range, we only need O (log log n + K log K) to restore its KNN in the query range from the optimized ARKGraph index. We conducted extensive experiments on real-world datasets. Experimental results show that our optimized ARKGraph index achieved a small index size, low query latency, and good scalability. Specifically, our approach was 1000x faster than the baseline method that builds a KGraph for all the vectors in the query range on-the-fly. 
    more » « less
  5. We present new algorithms for computing many faces in arrangements of lines and segments. Given a set $S$ of $n$ lines (resp., segments) and a set $P$ of $m$ points in the plane, the problem is to compute the faces of the arrangements of $S$ that contain at least one point of $P$. For the line case, we give a deterministic algorithm of $O(m^{2/3}n^{2/3}\log^{2/3} (n/\sqrt{m})+(m+n)\log n)$ time. This improves the previously best deterministic algorithm [Agarwal, 1990] by a factor of $\log^{2.22}n$ and improves the previously best randomized algorithm [Agarwal, Matoušek, and Schwarzkopf, 1998] by a factor of $\log^{1/3}n$ in certain cases (e.g., when $m=\Theta(n)$). For the segment case, we present a deterministic algorithm of $O(n^{2/3}m^{2/3}\log n+\tau(n\alpha^2(n)+n\log m+m)\log n)$ time, where $\tau=\min\{\log m,\log (n/\sqrt{m})\}$ and $\alpha(n)$ is the inverse Ackermann function. This improves the previously best deterministic algorithm [Agarwal, 1990] by a factor of $\log^{2.11}n$ and improves the previously best randomized algorithm [Agarwal, Matoušek, and Schwarzkopf, 1998] by a factor of $\log n$ in certain cases (e.g., when $m=\Theta(n)$). We also give a randomized algorithm of $O(m^{2/3}K^{1/3}\log n+\tau(n\alpha(n)+n\log m+m)\log n\log K)$ expected time, where $K$ is the number of intersections of all segments of $S$. In addition, we consider the query version of the problem, that is, preprocess $S$ to compute the face of the arrangement of $S$ that contains any given query point. We present new results that improve the previous work for both the line and the segment cases. In particulary, for the line case, we build a data structure of $O(n\log n)$ space in $O(n\log n)$ randomized time, so that the face containing the query point can be obtained in $O(\sqrt{n\log n})$ time with high probability (more specifically, the query returns a binary search tree representing the face so that standard binary-search-based queries on the face can be handled in $O(\log n)$ time each and the face itself can be output explicitly in time linear in its size). 
    more » « less