skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Quasi-Monte Carlo Quasi-Newton in Variational Bayes
Many machine learning problems optimize an objective that must be measured with noise. The primary method is a first order stochastic gradient descent using one or more Monte Carlo (MC) samples at each step. There are settings where ill-conditioning makes second order methods such as limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) more effective. We study the use of randomized quasi-Monte Carlo (RQMC) sampling for such problems. When MC sampling has a root mean squared error (RMSE) of O(n−1/2) then RQMC has an RMSE of o(n−1/2) that can be close to O(n−3/2) in favorable settings. We prove that improved sampling accuracy translates directly to improved optimization. In our empirical investigations for variational Bayes, using RQMC with stochastic quasi-Newton method greatly speeds up the optimization, and sometimes finds a better parameter value than MC does.  more » « less
Award ID(s):
1837931
PAR ID:
10346356
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
22
ISSN:
1532-4435
Page Range / eLocation ID:
1-23
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Alex Keller (Ed.)
    Quasi-Monte Carlo (QMC) points are a substitute for plain Monte Carlo (MC) points that greatly improve integration accuracy under mild assumptions on the problem. Because QMC can give errors that are o(1/n) as n → ∞, and randomized versions can attain root mean squared errors that are o(1/n), changing even one point can change the estimate by an amount much larger than the error would have been and worsen the convergence rate. As a result, certain practices that fit quite naturally and intuitively with MC points can be very detrimental to QMC performance. These include thinning, burn-in, and taking sample sizes such as powers of 10, when the QMC points were designed for different sample sizes. This article looks at the effects of a common practice in which one skips the first point of a Sobol’ sequence. The retained points ordinarily fail to be a digital net and when scrambling is applied, skipping over the first point can increase the numerical error by a factor proportional to √n where n is the number of function evaluations used. 
    more » « less
  2. We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $$\epsilon$$ accuracy in 2-Wasserstein distance, our algorithm achieves $$\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}%\wedge\frac{\kappa^2L^{-2}d\sigma^2}{\epsilon^2} \big)$$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the state-of-the-art HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general log-concave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm. 
    more » « less
  3. We study the problem of finding an 𝜀-first-order stationary point (FOSP) of a smooth function, given access only to gradient information. The best-known gradient query complexity for this task, assuming both the gradient and Hessian of the objective function are Lipschitz continuous, is O(𝜀−7/4). In this work, we propose a method with a gradient complexity of O(𝑑1/4𝜀−13/8), where 𝑑 is the problem dimension, leading to an improved complexity when 𝑑 = O(𝜀−1/2). To achieve this result, we design an optimization algorithm that, underneath, involves solving two online learning problems. Specifically, we first reformulate the task of finding a stationary point for a nonconvex problem as minimizing the regret in an online convex optimization problem, where the loss is determined by the gradient of the objective function. Then, we introduce a novel optimistic quasi-Newton method to solve this online learning problem, with the Hessian approximation update itself framed as an online learning problem in the space of matrices. Beyond improving the complexity bound for achieving an 𝜀-FOSP using a gradient oracle, our result provides the first guarantee suggesting that quasi-Newton methods can potentially outperform gradient descent-type methods in nonconvex settings. 
    more » « less
  4. We propose Pathfinder, a variational method for approximately sampling from differentiable probability densities. Starting from a random initialization, Pathfinder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer. Pathfinder returns draws from the approximation with the lowest estimated Kullback-Leibler (KL) divergence to the target distribution. We evaluate Pathfinder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1-Wasserstein distance. Compared to ADVI and short dynamic HMC runs, Pathfinder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors. Importance resampling over multiple runs of Pathfinder improves the diversity of approximate draws, reducing 1-Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes. The Monte Carlo KL divergence estimates are embarrassingly parallelizable in the core Pathfinder algorithm, as are multiple runs in the resampling version, further increasing Pathfinder's speed advantage with multiple cores. 
    more » « less
  5. Radiative processes such as synchrotron radiation and Compton scattering play an important role in astrophysics. Radiative processes are fundamentally stochastic in nature, and the best tools currently used for resolving these processes computationally are Monte Carlo (MC) methods. These methods typically draw a large number of samples from a complex distribution such as the differential cross section for electron–photon scattering, and then use these samples to compute the radiation properties such as angular distribution, spectrum, and polarization. In this work, we propose a machine learning (ML) technique for efficient sampling from arbitrary known probability distributions that can be used to accelerate MC calculation of radiative processes in astrophysical scenarios. In particular, we apply our technique to inverse Compton radiation and find that our ML method can be up to an order of magnitude faster than traditional methods currently in use. 
    more » « less