skip to main content

This content will become publicly available on July 7, 2022

Title: Algorithm and Hardware Co-Design for FPGA Acceleration of Hamiltonian Monte Carlo Based No-U-Turn Sampler
Monte Carlo (MC) methods are widely used in many research areas such as physical simulation, statistical analysis, and machine learning. Application of MC methods requires drawing fast mixing samples from a given probability distribution. Among existing sampling methods, the Hamiltonian Monte Carlo (HMC) utilizes gradient information during Hamiltonian simulation and can produce fast mixing samples at the highest efficiency. However, without carefully chosen simulation parameters for a specific problem, HMC generally suffers from simulation locality and computation waste. As a result, the No-U-Turn Sampler (NUTS) has been proposed to automatically tune these parameters during simulation and is the current state-of-the-art sampling algorithm. However, application of NUTS requires frequent gradient calculation of a given distribution and high-volume vector processing, especially for large-scale problems, leading to drawing an expensively large number of samples and a desire of hardware acceleration. While some hardware acceleration works have been proposed for traditional Markov Chain Monte Carlo (MCMC) and HMC methods, there is no existing work targeting hardware acceleration of the NUTS algorithm. In this paper, we present the first NUTS accelerator on FPGA while addressing the high complexity of this state-of-the-art algorithm. Our hardware and algorithm co-optimizations include an incremental resampling technique which leads to more » a more memory efficient architecture and pipeline optimization for multi-chain sampling to maximize the throughput. We also explore three levels of parallelism in the NUTS accelerator to further boost performance. Compared with optimized C++ NUTS package: RSTAN, our NUTS accelerator can reach a maximum speedup of 50.6X and an energy improvement of 189.7X. « less
Award ID(s):
Publication Date:
Journal Name:
The 32nd IEEE International Conference on Application-specific Systems, Architectures and Processors
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly log-concave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $\epsilon$ accuracy in 2-Wasserstein distance, our algorithm achieves $\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}%\wedge\frac{\kappa^2L^{-2}d\sigma^2}{\epsilon^2} \big)$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the state-of-the-art HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general log-concave distributions, and prove the corresponding gradient complexity as well. Experiments onmore »both synthetic and real data demonstrate the superior performance of our algorithm.« less
  2. Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) methods have been widely used to sample from certain probability distributions, incorporating (kernel) density derivatives and/or given datasets. Instead of exploring new samples from kernel spaces, this piece of work proposed a novel SGHMC sampler, namely Spectral Hamiltonian Monte Carlo (SpHMC), that produces the high dimensional sparse representations of given datasets through sparse sensing and SGHMC. Inspired by compressed sensing, we assume all given samples are low-dimensional measurements of certain high-dimensional sparse vectors, while a continuous probability distribution exists in such high-dimensional space. Specifically, given a dictionary for sparse coding, SpHMC first derives amore »novel likelihood evaluator of the probability distribution from the loss function of LASSO, then samples from the high-dimensional distribution using stochastic Langevin dynamics with derivatives of the logarithm likelihood and Metropolis–Hastings sampling. In addition, new samples in low-dimensional measuring spaces can be regenerated using the sampled high-dimensional vectors and the dictionary. Extensive experiments have been conducted to evaluate the proposed algorithm using real-world datasets. The performance comparisons on three real-world applications demonstrate the superior performance of SpHMC beyond baseline methods.« less
  3. Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC) are two popular Markov Chain Monte Carlo (MCMC) algorithms for Bayesian inference that can scale to large datasets, allowing to sample from the posterior distribution of the parameters of a statistical model given the input data and the prior distribution over the model parameters. However, these algorithms do not apply to the decentralized learning setting, when a network of agents are working collaboratively to learn the parameters of a statistical model without sharing their individual data due to privacy reasons or communication constraints. We study two algorithms: Decentralizedmore »SGLD (DE-SGLD) and Decentralized SGHMC (DE-SGHMC) which are adaptations of SGLD and SGHMC methods that allow scaleable Bayesian inference in the decentralized setting for large datasets. We show that when the posterior distribution is strongly log-concave and smooth, the iterates of these algorithms converge linearly to a neighborhood of the target distribution in the 2-Wasserstein distance if their parameters are selected appropriately. We illustrate the efficiency of our algorithms on decentralized Bayesian linear regression and Bayesian logistic regression problems« less
  4. We give a algorithm for exact sampling from the Bingham distribution p(x)∝exp(x⊤Ax) on the sphere Sd−1 with expected runtime of poly(d,λmax(A)−λmin(A)). The algorithm is based on rejection sampling, where the proposal distribution is a polynomial approximation of the pdf, and can be sampled from by explicitly evaluating integrals of polynomials over the sphere. Our algorithm gives exact samples, assuming exact computation of an inverse function of a polynomial. This is in contrast with Markov Chain Monte Carlo algorithms, which are not known to enjoy rapid mixing on this problem, and only give approximate samples. As a direct application, we usemore »this to sample from the posterior distribution of a rank-1 matrix inference problem in polynomial time.« less
  5. Differential privacy has been widely adopted to release continuous- and scalar-valued information on a database without compromising the privacy of individual data records in it. The problem of querying binary- and matrix-valued information on a database in a differentially private manner has rarely been studied. However, binary- and matrix-valued data are ubiquitous in real-world applications, whose privacy concerns may arise under a variety of circumstances. In this paper, we devise an exclusive or (XOR) mechanism that perturbs binary- and matrix-valued query result by conducting an XOR operation on the query result with calibrated noises attributed to a matrix-valued Bernoulli distribution.more »We first rigorously analyze the privacy and utility guarantee of the proposed XOR mechanism. Then, to generate the parameters in the matrix-valued Bernoulli distribution, we develop a heuristic approach to minimize the expected square query error rate under ϵ -differential privacy constraint. Additionally, to address the intractability of calculating the probability density function (PDF) of this distribution and efficiently generate samples from it, we adapt an Exact Hamiltonian Monte Carlo based sampling scheme. Finally, we experimentally demonstrate the efficacy of the XOR mechanism by considering binary data classification and social network analysis, all in a differentially private manner. Experiment results show that the XOR mechanism notably outperforms other state-of-the-art differentially private methods in terms of utility (such as classification accuracy and F 1 score), and even achieves comparable utility to the non-private mechanisms.« less