We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly logconcave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $\epsilon$ accuracy in 2Wasserstein distance, our algorithm achieves $\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}%\wedge\frac{\kappa^2L^{2}d\sigma^2}{\epsilon^2}
\big)$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the stateoftheart HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general logconcave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.
more »
« less
Accelerating SGD for Highly IllConditioned HugeScale Online Matrix Completion
The matrix completion problem seeks to recover a $d\times d$ ground
truth matrix of low rank $r\ll d$ from observations of its individual
elements. Realworld matrix completion is often a hugescale optimization
problem, with $d$ so large that even the simplest fulldimension
vector operations with $O(d)$ time complexity become prohibitively
expensive. Stochastic gradient descent (SGD) is one of the few algorithms
capable of solving matrix completion on a huge scale, and can also
naturally handle streaming data over an evolving ground truth. Unfortunately,
SGD experiences a dramatic slowdown when the underlying ground truth
is illconditioned; it requires at least $O(\kappa\log(1/\epsilon))$
iterations to get $\epsilon$close to ground truth matrix with condition
number $\kappa$. In this paper, we propose a preconditioned version
of SGD that preserves all the favorable practical qualities of SGD
for hugescale online optimization while also making it agnostic to
$\kappa$. For a symmetric ground truth and the Root Mean Square Error
(RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$accuracy
in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence
rate as if the ground truth were perfectly conditioned with $\kappa=1$.
In our numerical experiments, we observe a similar acceleration for
illconditioned matrix completion under the 1bit crossentropy loss,
as well as pairwise losses such as the Bayesian Personalized Ranking (BPR) loss.
more »
« less
 Award ID(s):
 2047462
 NSFPAR ID:
 10394475
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We develop a general framework for finding approximatelyoptimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following. \begin{itemize} \item \textbf{Diagonal preconditioning.} We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{1}))$, where $\kappa^\star$ is the optimal condition number of the rescaled matrix. \item \textbf{Structured linear systems.} We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time. \end{itemize} Our diagonal preconditioning results improve stateoftheart runtimes of $\Omega(d^{3.5})$ attained by generalpurpose semidefinite programming, and our solvers improve stateoftheart runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the current matrix multiplication constant. We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call \emph{matrixdictionary approximation SDPs}, which we leverage to solve an associated problem we call \emph{matrixdictionary recovery}.more » « less

null (Ed.)Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an $n \times n$ linear system $Ax=b$ is $\tilde{O}(n^\omega)$, where $\omega < 2.372864$ is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly$(n)$ condition number. In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any $\omega > 2$. This speedup holds for any input matrix $A$ with $o(n^{\omega 1}/\log(\kappa(A)))$ nonzeros, where $\kappa(A)$ is the condition number of $A$. For poly$(n)$conditioned matrices with $\tilde{O}(n)$ nonzeros, and the current value of $\omega$, the bit complexity of our algorithm to solve to within any $1/\text{poly}(n)$ error is $O(n^{2.331645})$. Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC `06 `07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anticoncentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semirandom matrices.more » « less

The iteratively reweighted least squares method (IRLS) is a popular technique used in practice for solving regression problems. Various versions of this method have been proposed, but their theoretical analyses failed to capture the good practical performance. In this paper we propose a simple and natural version of IRLS for solving $\ell_\infty$ and $\ell_1$ regression, which provably converges to a $(1+\epsilon)$approximate solution in $O(m^{1/3}\log(1/\epsilon)/\epsilon^{2/3} + \log m/\epsilon^2)$ iterations, where $m$ is the number of rows of the input matrix. Interestingly, this running time is independent of the conditioning of the input, and the dominant term of the running time depends sublinearly in $\epsilon^{1}$, which is atypical for the optimization of nonsmooth functions. This improves upon the more complex algorithms of Chin et al. (ITCS '12), and Christiano et al. (STOC '11) by a factor of at least $1/\epsilon^2$, and yields a truly efficient natural algorithm for the slime mold dynamics (StraszakVishnoi, SODA '16, ITCS '16, ITCS '17).more » « less

An \ell _p oblivious subspace embedding is a distribution over r \times n matrices \Pi such that for any fixed n \times d matrix A , \[ \Pr _{\Pi }[\textrm {for all }x, \ \Vert Ax\Vert _p \le \Vert \Pi Ax\Vert _p \le \kappa \Vert Ax\Vert _p] \ge 9/10,\] where r is the dimension of the embedding, \kappa is the distortion of the embedding, and for an n dimensional vector y , \Vert y\Vert _p = (\sum _{i=1}^n y_i^p)^{1/p} is the \ell _p norm. Another important property is the sparsity of \Pi , that is, the maximum number of nonzero entries per column, as this determines the running time of computing \Pi A . While for p = 2 there are nearly optimal tradeoffs in terms of the dimension, distortion, and sparsity, for the important case of 1 \le p \lt 2 , much less was known. In this article, we obtain nearly optimal tradeoffs for \ell _1 oblivious subspace embeddings, as well as new tradeoffs for 1 \lt p \lt 2 . Our main results are as follows: (1) We show for every 1 \le p \lt 2 , any oblivious subspace embedding with dimension r has distortion \[ \kappa = \Omega \left(\frac{1}{\left(\frac{1}{d}\right)^{1 / p} \log ^{2 / p}r + \left(\frac{r}{n}\right)^{1 / p  1 / 2}}\right).\] When r = {\operatorname{poly}}(d) \ll n in applications, this gives a \kappa = \Omega (d^{1/p}\log ^{2/p} d) lower bound, and shows the oblivious subspace embedding of Sohler and Woodruff (STOC, 2011) for p = 1 is optimal up to {\operatorname{poly}}(\log (d)) factors. (2) We give sparse oblivious subspace embeddings for every 1 \le p \lt 2 . Importantly, for p = 1 , we achieve r = O(d \log d) , \kappa = O(d \log d) and s = O(\log d) nonzero entries per column. The best previous construction with s \le {\operatorname{poly}}(\log d) is due to Woodruff and Zhang (COLT, 2013), giving \kappa = \Omega (d^2 {\operatorname{poly}}(\log d)) or \kappa = \Omega (d^{3/2} \sqrt {\log n} \cdot {\operatorname{poly}}(\log d)) and r \ge d \cdot {\operatorname{poly}}(\log d) ; in contrast our r = O(d \log d) and \kappa = O(d \log d) are optimal up to {\operatorname{poly}}(\log (d)) factors even for dense matrices. We also give (1) \ell _p oblivious subspace embeddings with an expected 1+\varepsilon number of nonzero entries per column for arbitrarily small \varepsilon \gt 0 , and (2) the first oblivious subspace embeddings for 1 \le p \lt 2 with O(1) distortion and dimension independent of n . Oblivious subspace embeddings are crucial for distributed and streaming environments, as well as entrywise \ell _p lowrank approximation. Our results give improved algorithms for these applications.more » « less