skip to main content


Title: Stein Variational Gradient Descent With Matrix-Valued Kernels
Stein variational gradient descent (SVGD) is a particle-based inference algorithm that leverages gradient information for efficient approximate inference. In this work, we enhance SVGD by leveraging preconditioning matrices, such as the Hessian and Fisher information matrix, to incorporate geometric information into SVGD updates. We achieve this by presenting a generalization of SVGD that replaces the scalar-valued kernels in vanilla SVGD with more general matrix-valued kernels. This yields a significant extension of SVGD, and more importantly, allows us to flexibly incorporate various preconditioning matrices to accelerate the exploration in the probability landscape. Empirical results show that our method outperforms vanilla SVGD and a variety of baseline approaches over a range of real-world Bayesian inference tasks.  more » « less
Award ID(s):
1846421
NSF-PAR ID:
10172992
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Conference on Neural Information Processing Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sampling-based inference and learning techniques, especially Bayesian inference, provide an essential approach to handling uncertainty in machine learning (ML). As these techniques are increasingly used in daily life, it becomes essential to safeguard the ML systems with various trustworthy-related constraints, such as fairness, safety, interpretability. Mathematically, enforcing these constraints in probabilistic inference can be cast into sampling from intractable distributions subject to general nonlinear constraints, for which practical efficient algorithms are still largely missing. In this work, we propose a family of constrained sampling algorithms which generalize Langevin Dynamics (LD) and Stein Variational Gradient Descent (SVGD) to incorporate a moment constraint specified by a general nonlinear function. By exploiting the gradient flow structure of LD and SVGD, we derive two types of algorithms for handling constraints, including a primal-dual gradient approach and the constraint controlled gradient descent approach. We investigate the continuous-time mean-field limit of these algorithms and show that they have O(1/t) convergence under mild conditions. Moreover, the LD variant converges linearly assuming that a log Sobolev like inequality holds. Various numerical experiments are conducted to demonstrate the efficiency of our algorithms in trustworthy settings. 
    more » « less
  2. Gradient-based approximate inference methods, such as Stein variational gradient descent (SVGD), provide simple and general-purpose inference engines for differentiable continuous distributions. However, existing forms of SVGD cannot be directly applied to discrete distributions. In this work, we fill this gap by proposing a simple yet general framework that transforms discrete distributions to equivalent piecewise continuous distributions, on which the gradient-free SVGD is applied to perform efficient approximate inference. The empirical results show that our method outperforms traditional algorithms such as Gibbs sampling and discontinuous Hamiltonian Monte Carlo on various challenging benchmarks of discrete graphical models. We demonstrate that our method provides a promising tool for learning ensembles of binarized neural network (BNN), outperforming other widely used ensemble methods on learning binarized AlexNet on CIFAR-10 dataset. In addition, such transform can be straightforwardly employed in gradient-free kernelized Stein discrepancy to perform goodness-of-fit (GOF) test on discrete distributions. Our proposed method outperforms existing GOF test methods for intractable discrete distributions. 
    more » « less
  3. We propose unbalanced versions of the quantum mechanical version of optimal mass transport that is based on the Lindblad equation describing open quantum systems. One of them is a natural interpolation framework between matrices and matrix-valued measures via a quantum mechanical formulation of Fisher-Rao information and the matricial Wasserstein distance, and the second is an interpolation between Wasserstein distance and Frobenius norm. We also give analogous results for the matrix-valued density measures, i.e., we add a spatial dependency on the density matrices. This might extend the applications of the framework to interpolating matrix-valued densities/images with unequal masses. 
    more » « less
  4. Finding diverse and representative Pareto solutions from the Pareto front is a key challenge in multi-objective optimization (MOO). In this work, we propose a novel gradient-based algorithm for profiling Pareto front by using Stein variational gradient descent (SVGD). We also provide a counterpart of our method based on Langevin dynamics. Our methods iteratively update a set of points in a parallel fashion to push them towards the Pareto front using multiple gradient descent, while encouraging the diversity between the particles by using the repulsive force mechanism in SVGD, or diffusion noise in Langevin dynamics. Compared with existing gradient-based methods that require predefined preference functions, our method can work efficiently in high dimensional problems, and can obtain more diverse solutions evenly distributed in the Pareto front. Moreover, our methods are theoretically guaranteed to converge to the Pareto front. We demonstrate the effectiveness of our method, especially the SVGD algorithm, through extensive experiments, showing its superiority over existing gradient-based algorithms. 
    more » « less
  5. This paper focuses on the mutual information and minimum mean-squared error (MMSE) as a function a matrix- valued signal-to-noise ratio (SNR) for a linear Gaussian channel with arbitrary input distribution. As shown by Lamarca, the mutual-information is a concave function of a positive semi- definite matrix, which we call the matrix SNR. This implies that the mapping from the matrix SNR to the MMSE matrix is decreasing monotone. Building upon these functional properties, we start to construct a unifying framework that provides a bridge between classical information-theoretic inequalities, such as the entropy power inequality, and interpolation techniques used in statistical physics and random matrix theory. This framework provides new insight into the structure of phase transitions in coding theory and compressed sensing. In particular, it is shown that the parallel combination of linear channels with freely-independent matrices can be characterized succinctly via free convolution. 
    more » « less