Simple stochastic momentum methods are widely used in machine learning optimization, but their good practical performance is at odds with an absence of theoretical guarantees of acceleration in the literature. In this work, we aim to close the gap between theory and practice by showing that stochastic heavy ball momentum retains the fast linear rate of (deterministic) heavy ball momentum on quadratic optimization problems, at least when minibatching with a sufficiently large batch size. The algorithm we study can be interpreted as an accelerated randomized Kaczmarz algorithm with minibatching and heavy ball momentum. The analysis relies on carefully decomposing the momentum transition matrix, and using new spectral norm concentration bounds for products of independent random matrices. We provide numerical illustrations demonstrating that our bounds are reasonably sharp.
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Abstract -
We study the GMRES algorithm applied to linear systems of equations involving a scaled and shifted
matrix whose entries are independent complex Gaussians. When the right-hand side of this linear system is independent of this random matrix, the behavior of the GMRES residual error can be determined exactly. To handle cases where the right hand side depends on the random matrix, we study the pseudospectra and numerical range of Ginibre matrices and prove a restricted version of Crouzeix’s conjecture. Free, publicly-accessible full text available March 25, 2025 -
Free, publicly-accessible full text available February 1, 2025
-
The kernel polynomial method (KPM) is a powerful numerical method for approximating spectral densities. Typical implementations of the KPM require an a prior estimate for an interval containing the support of the target spectral density, and while such estimates can be obtained by classical techniques, this incurs addition computational costs. We propose a spectrum adaptive KPM based on the Lanczos algorithm without reorthogonalization, which allows the selection of KPM parameters to be deferred to after the expensive computation is finished. Theoretical results from numerical analysis are given to justify the suitability of the Lanczos algorithm for our approach, even in finite precision arithmetic. While conceptually simple, the paradigm of decoupling computation from approximation has a number of practical and pedagogical benefits, which we highlight with numerical examples.
-
Abstract. We describe a Lanczos-based algorithm for approximating the product of a rational matrix function with a vector. This algorithm, which we call the Lanczos method for optimal rational matrix function approximation (Lanczos-OR), returns the optimal approximation from a given Krylov subspace in a norm depending on the rational function’s denominator, and it can be computed using the information from a slightly larger Krylov subspace. We also provide a low-memory implementation which only requires storing a number of vectors proportional to the denominator degree of the rational function. Finally, we show that Lanczos-OR can be used to derive algorithms for computing other matrix functions, including the matrix sign function and quadrature-based rational function approximations. In many cases, it improves on the approximation quality of prior approaches, including the standard Lanczos method, with little additional computational overhead.more » « less
-
We study the effects of rounding on the moments of random variables. Specifically, given a random variable
X and its rounded counterpart , we study for non‐negative integerk . We consider the case that the rounding function corresponds either to (i) rounding to the nearest point in some discrete set or (ii) rounding randomly to either the nearest larger or smaller point in this same set with probabilities proportional to the distances to these points. In both cases, we show, under reasonable assumptions on the density function ofX , how to compute a constantC such that , provided , where is some fixed positive piecewise linear function. Refined bounds for the absolute moments are also given. -
The cumulative empirical spectral measure (CESM) $\Phi[\mathbf{A}] : \mathbb{R} \to [0,1]$ of a $n\times n$ symmetric matrix $\mathbf{A}$ is defined as the fraction of eigenvalues of $\mathbf{A}$ less than a given threshold, i.e., $\Phi[\mathbf{A}](x) := \sum_{i=1}^{n} \frac{1}{n} {\large\unicode{x1D7D9}}[ \lambda_i[\mathbf{A}]\leq x]$. Spectral sums $\operatorname{tr}(f[\mathbf{A}])$ can be computed as the Riemann–Stieltjes integral of $f$ against $\Phi[\mathbf{A}]$, so the task of estimating CESM arises frequently in a number of applications, including machine learning. We present an error analysis for stochastic Lanczos quadrature (SLQ). We show that SLQ obtains an approximation to the CESM within a Wasserstein distance of $t \: | \lambda_{\text{max}}[\mathbf{A}] - \lambda_{\text{min}}[\mathbf{A}] |$ with probability at least $1-\eta$, by applying the Lanczos algorithm for $\lceil 12 t^{-1} + \frac{1}{2} \rceil$ iterations to $\lceil 4 ( n+2 )^{-1}t^{-2} \ln(2n\eta^{-1}) \rceil$ vectors sampled independently and uniformly from the unit sphere. We additionally provide (matrix-dependent) a posteriori error bounds for the Wasserstein and Kolmogorov–Smirnov distances between the output of this algorithm and the true CESM. The quality of our bounds is demonstrated using numerical experiments.more » « less