skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Coresets for Classification - Simplified and Strengthened
We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $$(1\pm \epsilon)$$ relative error with $$\tilde O(d \cdot \mu_y(X)^2/\epsilon^2)$$ points, where $$\mu_y(X)$$ is a natural complexity measure of the data matrix $$X \in \mathbb{R}^{n \times d}$$ and label vector $$y \in \{-1,1\}^n$$, introduced in Munteanu et al. 2018. Our result is based on subsampling data points with probabilities proportional to their \textit{$$\ell_1$$ Lewis weights}. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.  more » « less
Award ID(s):
2046235
PAR ID:
10326700
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study the $$\ell_p$$ regression problem, which requires finding $$\mathbf{x}\in\mathbb R^{d}$$ that minimizes $$\|\mathbf{A}\mathbf{x}-\mathbf{b}\|_p$$ for a matrix $$\mathbf{A}\in\mathbb R^{n \times d}$$ and response vector $$\mathbf{b}\in\mathbb R^{n}$$. There has been recent interest in developing subsampling methods for this problem that can outperform standard techniques when $$n$$ is very large. However, all known subsampling approaches have run time that depends exponentially on $$p$$, typically, $$d^{\mathcal{O}(p)}$$, which can be prohibitively expensive. We improve on this work by showing that for a large class of common \emph{structured matrices}, such as combinations of low-rank matrices, sparse matrices, and Vandermonde matrices, there are subsampling based methods for $$\ell_p$$ regression that depend polynomially on $$p$$. For example, we give an algorithm for $$\ell_p$$ regression on Vandermonde matrices that runs in time $$\mathcal{O}(n\log^3 n+(dp^2)^{0.5+\omega}\cdot\text{polylog}\,n)$$, where $$\omega$$ is the exponent of matrix multiplication. The polynomial dependence on $$p$$ crucially allows our algorithms to extend naturally to efficient algorithms for $$\ell_\infty$$ regression, via approximation of $$\ell_\infty$$ by $$\ell_{\mathcal{O}(\log n)}$$. Of practical interest, we also develop a new subsampling algorithm for $$\ell_p$$ regression for arbitrary matrices, which is simpler than previous approaches for $$p \ge 4$$. 
    more » « less
  2. We study the problem of private vector mean estimation in the shuffle model of privacy where n users each have a unit vector v^{(i)} in R^d. We propose a new multi-message protocol that achieves the optimal error using O~(min(n*epsilon^2, d)) messages per user. Moreover, we show that any (unbiased) protocol that achieves optimal error requires each user to send Omega(min(n*epsilon^2,d)/log(n)) messages, demonstrating the optimality of our message complexity up to logarithmic factors. Additionally, we study the single-message setting and design a protocol that achieves mean squared error O(dn^{d/(d+2)} * epsilon^{-4/(d+2)}). Moreover, we show that any single-message protocol must incur mean squared error Omega(dn^{d/(d+2)}), showing that our protocol is optimal in the standard setting where epsilon = Theta(1). Finally, we study robustness to malicious users and show that malicious users can incur large additive error with a single shuffler. 
    more » « less
  3. We consider the problem of performing linear regression over a stream of d-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples (a_1,b_1), (a_2,b_2)..., with a_i drawn independently from a d-dimensional isotropic Gaussian, and where b_i = + \eta_i, for a fixed x in R^d with ||x||= 1 and with independent noise \eta_i drawn uniformly from the interval [-2^{-d/5},2^{-d/5}]. We show that any algorithm with at most d^2/4 bits of memory requires at least \Omega(d \log \log \frac{1}{\epsilon}) samples to approximate x to \ell_2 error \epsilon with probability of success at least 2/3, for \epsilon sufficiently small as a function of d. In contrast, for such \epsilon, x can be recovered to error \epsilon with probability 1-o(1) with memory O\left(d^2 \log(1/\epsilon)\right) using d examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization. 
    more » « less
  4. We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $$\epsilon$$ accuracy in 2-Wasserstein distance, our algorithm achieves $$\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}%\wedge\frac{\kappa^2L^{-2}d\sigma^2}{\epsilon^2} \big)$$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the state-of-the-art HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general log-concave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm. 
    more » « less
  5. We refine and generalize what is known about coresets for classification problems via the sensitivity sampling framework. Such coresets seek the smallest possible subsets of input data, so one can optimize a loss function on the coreset and ensure approximation guarantees with respect to the original data. Our analysis provides the first no dimensional coresets, so the size does not depend on the dimension. Moreover, our results are general, apply for distributional input and can use iid samples, so provide sample complexity bounds, and work for a variety of loss functions. A key tool we develop is a Radamacher complexity version of the main sensitivity sampling approach, which can be of independent interest. 
    more » « less