We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly logconcave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $\epsilon$ accuracy in 2Wasserstein distance, our algorithm achieves $\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}%\wedge\frac{\kappa^2L^{2}d\sigma^2}{\epsilon^2}
\big)$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the stateoftheart HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general logconcave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.
more »
« less
QuasiMonte Carlo QuasiNewton in Variational Bayes
Many machine learning problems optimize an objective that must be measured with noise. The primary method is a first order stochastic gradient descent using one or more Monte Carlo (MC) samples at each step. There are settings where illconditioning makes second order methods such as limitedmemory BroydenFletcherGoldfarbShanno (LBFGS) more effective. We study the use of randomized quasiMonte Carlo (RQMC) sampling for such problems. When MC sampling has a root mean squared error (RMSE) of O(n−1/2) then RQMC has an RMSE of o(n−1/2) that can be close to O(n−3/2) in favorable settings. We prove that improved sampling accuracy translates directly to improved optimization. In our empirical investigations for variational Bayes, using RQMC with stochastic quasiNewton method greatly speeds up the optimization, and sometimes finds a better parameter value than MC does.
more »
« less
 Award ID(s):
 1837931
 NSFPAR ID:
 10346356
 Date Published:
 Journal Name:
 Journal of machine learning research
 Volume:
 22
 ISSN:
 15324435
 Page Range / eLocation ID:
 123
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Alex Keller (Ed.)QuasiMonte Carlo (QMC) points are a substitute for plain Monte Carlo (MC) points that greatly improve integration accuracy under mild assumptions on the problem. Because QMC can give errors that are o(1/n) as n → ∞, and randomized versions can attain root mean squared errors that are o(1/n), changing even one point can change the estimate by an amount much larger than the error would have been and worsen the convergence rate. As a result, certain practices that fit quite naturally and intuitively with MC points can be very detrimental to QMC performance. These include thinning, burnin, and taking sample sizes such as powers of 10, when the QMC points were designed for different sample sizes. This article looks at the effects of a common practice in which one skips the first point of a Sobol’ sequence. The retained points ordinarily fail to be a digital net and when scrambling is applied, skipping over the first point can increase the numerical error by a factor proportional to √n where n is the number of function evaluations used.more » « less

We propose Pathfinder, a variational method for approximately sampling from differentiable probability densities. Starting from a random initialization, Pathfinder locates normal approximations to the target density along a quasiNewton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer. Pathfinder returns draws from the approximation with the lowest estimated KullbackLeibler (KL) divergence to the target distribution. We evaluate Pathfinder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1Wasserstein distance. Compared to ADVI and short dynamic HMC runs, Pathfinder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors. Importance resampling over multiple runs of Pathfinder improves the diversity of approximate draws, reducing 1Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes. The Monte Carlo KL divergence estimates are embarrassingly parallelizable in the core Pathfinder algorithm, as are multiple runs in the resampling version, further increasing Pathfinder's speed advantage with multiple cores.more » « less

Langevin Monte Carlo (LMC) and its stochastic gradient versions are powerful algorithms for sampling from complex highdimensional distributions. To sample from a distribution with density π(θ) ∝ exp(−U(θ)), LMC iteratively generates the next sample by taking a step in the gradient direction ∇U with added Gaus sian perturbations. Expectations w.r.t. the target distribution π are estimated by averaging over LMC samples. In ordinary Monte Carlo, it is well known that the estimation error can be substantially reduced by replacing independent random samples by quasirandom samples like lowdiscrepancy sequences. In this work, we show that the estimation error of LMC can also be reduced by using quasi random samples. Specifically, we propose to use completely uniformly distributed (CUD) sequences with certain lowdiscrepancy property to generate the Gaussian perturbations. Under smoothness and convexity conditions, we prove that LMC with a lowdiscrepancy CUD sequence achieves smaller error than standard LMC. The theoretical analysis is supported by compelling numerical experiments, which demonstrate the effectiveness of our approach.more » « less

Belkin, M. ; Kpotufe, S. (Ed.)Langevin algorithms are gradient descent methods with additive noise. They have been used for decades in Markov Chain Monte Carlo (MCMC) sampling, optimization, and learning. Their convergence properties for unconstrained nonconvex optimization and learning problems have been studied widely in the last few years. Other work has examined projected Langevin algorithms for sampling from logconcave distributions restricted to convex compact sets. For learning and optimization, logconcave distributions correspond to convex losses. In this paper, we analyze the case of nonconvex losses with compact convex constraint sets and IID external data variables. We term the resulting method the projected stochastic gradient Langevin algorithm (PSGLA). We show the algorithm achieves a deviation of 𝑂(𝑇−1/4(𝑙𝑜𝑔𝑇)1/2) from its target distribution in 1Wasserstein distance. For optimization and learning, we show that the algorithm achieves 𝜖suboptimal solutions, on average, provided that it is run for a time that is polynomial in 𝜖 and slightly superexponential in the problem dimension.more » « less