Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

We consider the (1+ϵ)approximate nearest neighbor search problem: given a set X of n points in a ddimensional space, build a data structure that, given any query point y, finds a point x∈X whose distance to y is at most (1+ϵ)minx∈X ‖x−y‖ for an accuracy parameter ϵ∈(0,1). Our main result is a data structure that occupies only O(ϵ^−2 n log(n)log(1/ϵ)) bits of space, assuming all point coordinates are integers in the range {−n^O(1)…n^O(1)}, i.e., the coordinates have O(logn) bits of precision. This improves over the best previously known space bound of O(ϵ^−2 n log(n)^2), obtained via the randomized dimensionality reduction method of Johnson and Lindenstrauss (1984). We also consider the more general problem of estimating all distances from a collection of query points to all data points X, and provide almost tight upper and lower bounds for the space complexity of this problem.

We study the classic set cover problem from the perspective of sublinear algorithms. Given access to a collection of m sets over n elements in the query model, we show that sublinear algorithms derived from existing techniques have almost tight query complexities. On one hand, first we show an adaptation of the streaming algorithm presented in [17] to the sublinear query model, that returns an αapproximate cover using Õ(m(n/k)^1/(α–1) + nk) queries to the input, where k denotes the value of a minimum set cover. We then complement this upper bound by proving that for lower values of k, the required number of queries is , even for estimating the optimal cover size. Moreover, we prove that even checking whether a given collection of sets covers all the elements would require Ω(nk) queries. These two lower bounds provide strong evidence that the upper bound is almost tight for certain values of the parameter k. On the other hand, we show that this bound is not optimal for larger values of the parameter k, as there exists a (1 + ε)approximation algorithm with Õ(mn/kε^2) queries. We show that this bound is essentially tight for sufficiently small constant ε, by establishing amore »

Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there is a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexitytheoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in subquadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many nonconvex learning tasks.

We introduce a new distancepreserving compact representation of multidimensional pointsets. Given n points in a ddimensional space where each coordinate is represented using B bits (i.e., dB bits per point), it produces a representation of size O( d log(d B/epsilon) +log n) bits per point from which one can approximate the distances up to a factor of 1 + epsilon. Our algorithm almost matches the recent bound of Indyk et al, 2017} while being much simpler. We compare our algorithm to Product Quantization (PQ) (Jegou et al, 2011) a state of the art heuristic metric compression method. We evaluate both algorithms on several data sets: SIFT, MNIST, New York City taxi time series and a synthetic onedimensional data set embedded in a highdimensional space. Our algorithm produces representations that are comparable to or better than those produced by PQ, while having provable guarantees on its performance.