skip to main content


Title: Training Discrete Deep Generative Models via Gapped Straight-Through Estimator
While deep generative models have succeeded in image processing, natural language processing, and reinforcement learning, training that involves discrete random variables remains challenging due to the high variance of its gradient estimation process. Monte Carlo is a common solution used in most variance reduction approaches. However, this involves time-consuming resampling and multiple function evaluations. We propose a Gapped Straight-Through (GST) estimator to reduce the variance without incurring resampling overhead. This estimator is inspired by the essential properties of Straight-Through Gumbel-Softmax. We determine these properties and show via an ablation study that they are essential. Experiments demonstrate that the proposed GST estimator enjoys better performance compared to strong baselines on two discrete deep generative modeling tasks, MNIST-VAE and ListOps.  more » « less
Award ID(s):
1919452
NSF-PAR ID:
10380235
Author(s) / Creator(s):
Editor(s):
Chaudhuri, Kamalika and
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
162
ISSN:
2640-3498
Page Range / eLocation ID:
6059-6073
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Injecting discrete logical constraints into neural network learning is one of the main challenges in neuro-symbolic AI. We find that a straight-through-estimator, a method introduced to train binary neural networks, could effectively be applied to incorporate logical constraints into neural network learning. More specifically, we design a systematic way to represent discrete logical constraints as a loss function; minimizing this loss using gradient descent via a straight-through-estimator updates the neural network's weights in the direction that the binarized outputs satisfy the logical constraints. The experimental results show that by leveraging GPUs and batch training, this method scales significantly better than existing neuro-symbolic methods that require heavy symbolic computation for computing gradients. Also, we demonstrate that our method applies to different types of neural networks, such as MLP, CNN, and GNN, making them learn with no or fewer labeled data by learning directly from known constraints. 
    more » « less
  2. Injecting discrete logical constraints into neural network learning is one of the main challenges in neuro-symbolic AI. We find that a straight-through-estimator, a method introduced to train binary neural networks, could effectively be applied to incorporate logical constraints into neural network learning. More specifically, we design a systematic way to represent discrete logical constraints as a loss function; minimizing this loss using gradient descent via a straight-through-estimator updates the neural network’s weights in the direction that the binarized outputs satisfy the logical constraints. The experimental results show that by leveraging GPUs and batch training, this method scales significantly better than existing neuro-symbolic methods that require heavy symbolic computation for computing gradients. Also, we demonstrate that our method applies to different types of neural networks, such as MLP, CNN, and GNN, making them learn with no or fewer labeled data by learning directly from known constraints. 
    more » « less
  3. Variance estimation is an important aspect in statistical inference, especially in the dependent data situations. Resampling methods are ideal for solving this problem since these do not require restrictive distributional assumptions. In this paper, we develop a novel resampling method in the Jackknife family called thestationary jackknife. It can be used to estimate the variance of a statistic in the cases where observations are from a general stationary sequence. Unlike the moving block jackknife, thestationary jackknifecomputes the jackknife replication by deleting a variable length block and the length has a truncated geometric distribution. Under appropriate assumptions, we can show thestationary jackknifevariance estimator is a consistent estimator for the case of the sample mean and, more generally, for a class of nonlinear statistics. Further, thestationary jackknifeis shown to provide reasonable variance estimation for a wider range of expected block lengths when compared with the moving block jackknife by simulation.

     
    more » « less
  4. We present a new back propagation based training algorithm for discrete-time spiking neural networks (SNN). Inspired by recent deep learning algorithms on binarized neural networks, binary activation with a straight-through gradient estimator is used to model the leaky integrate-fire spiking neuron, overcoming the difficulty in training SNNs using back propagation. Two SNN training algorithms are proposed: (1) SNN with discontinuous integration, which is suitable for rate-coded input spikes, and (2) SNN with continuous integration, which is more general and can handle input spikes with temporal information. Neuromorphic hardware designed in 28nm CMOS exploits the spike sparsity and demonstrates high classification accuracy (>98% on MNIST) and low energy (51.4–773 nJ/image). 
    more » « less
  5. Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood for the training data, we instead fit the score function, obviating the need to evaluate the partition function. Though this estimator is known to be consistent, it's unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood---which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated---i.e. the Poincar\'e, log-Sobolev and isoperimetric constant---quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant---even for simple families of distributions like exponential families with rich enough sufficient statistics---score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics. 
    more » « less