Abstract Quantized or lowbit neural networks are attractive due to their inference efficiency. However, training deep neural networks with quantized activations involves minimizing a discontinuous and piecewise constant loss function. Such a loss function has zero gradient almost everywhere (a.e.), which makes the conventional gradientbased algorithms inapplicable. To this end, we study a novel class of biased firstorder oracle, termed coarse gradient, for overcoming the vanished gradient issue. A coarse gradient is generated by replacing the a.e. zero derivative of quantized (i.e., staircase) ReLU activation composited in the chain rule with some heuristic proxy derivative called straightthrough estimator (STE). Although having been widely used in training quantized networks empirically, fundamental questions like when and why the ad hoc STE trick works, still lack theoretical understanding. In this paper, we propose a class of STEs with certain monotonicity and consider their applications to the training of a twolinearlayer network with quantized activation functions for nonlinear multicategory classification. We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum, which leads to a perfect classification. Lastly, we present experimental results on synthetic data as well as MNIST dataset to verify our theoretical findings and demonstrate the effectiveness of our proposed STEs.
more »
« less
Understanding straightthrough estimator in training activation quantized neural nets
Training activation quantized neural networks involves minimizing a piecewise
constant function whose gradient vanishes almost everywhere, which is undesirable
for the standard backpropagation or chain rule. An empirical way around
this issue is to use a straightthrough estimator (STE) (Bengio et al., 2013) in
the backward pass only, so that the “gradient” through the modified chain rule
becomes nontrivial. Since this unusual “gradient” is certainly not the gradient of
loss function, the following question arises: why searching in its negative direction
minimizes the training loss? In this paper, we provide the theoretical justification
of the concept of STE by answering this question. We consider the problem of
learning a twolinearlayer network with binarized ReLU activation and Gaussian
input data. We shall refer to the unusual “gradient” given by the STEmodifed
chain rule as coarse gradient. The choice of STE is not unique. We prove that
if the STE is properly chosen, the expected coarse gradient correlates positively
with the population gradient (not available for the training), and its negation is a
descent direction for minimizing the population loss. We further show the associated
coarse gradient descent algorithm converges to a critical point of the population
loss minimization problem. Moreover, we show that a poor choice of STE
leads to instability of the training algorithm near certain local minima, which is
verified with CIFAR10 experiments.
more »
« less
 Award ID(s):
 1632935
 NSFPAR ID:
 10112576
 Date Published:
 Journal Name:
 International Conference on Learning Representations
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular fullprecision counterparts. To maintain the same performance level especially at low bitwidths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the fullprecision weights and their quantization (the socalled blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bitwidth such as binarization. In full quantization of ResNet18 for ImageNet classification task, BCGD gives 64.36% top1 accuracy with binary weights across all layers and 4bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent for a twolinearlayer neural network model with Gaussian input data and prove that the expected coarse gradient correlates positively with the underlying true gradient.more » « less

Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. Binarized activation offers an additional computational saving for inference. Due to vanishing gradient issue in training networks with binarized activation, coarse gradient (a.k.a. straight through estimator) is adopted in practice. In this paper, we study the problem of coarse gradient descent (CGD) learning of a one hidden layer convolutional neural network (CNN) with binarized activation function and sparse weights. It is known that when the input data is Gaussian distributed, nooverlap one hidden layer CNN with ReLU activation and general weight can be learned by GD in polynomial time at high probability in regression problems with ground truth. We propose a relaxed variable splitting method integrating thresholding and coarse gradient descent. The sparsity in network weight is realized through thresholding during the CGD training process. We prove that under thresholding of L1, L0, and transformedL1 penalties, nooverlap binary activation CNN can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel sparsifying operation. We found explicit error estimates of sparse weights from the true weights.more » « less

Injecting discrete logical constraints into neural network learning is one of the main challenges in neurosymbolic AI. We find that a straightthroughestimator, a method introduced to train binary neural networks, could effectively be applied to incorporate logical constraints into neural network learning. More specifically, we design a systematic way to represent discrete logical constraints as a loss function; minimizing this loss using gradient descent via a straightthroughestimator updates the neural network’s weights in the direction that the binarized outputs satisfy the logical constraints. The experimental results show that by leveraging GPUs and batch training, this method scales significantly better than existing neurosymbolic methods that require heavy symbolic computation for computing gradients. Also, we demonstrate that our method applies to different types of neural networks, such as MLP, CNN, and GNN, making them learn with no or fewer labeled data by learning directly from known constraints.more » « less

Injecting discrete logical constraints into neural network learning is one of the main challenges in neurosymbolic AI. We find that a straightthroughestimator, a method introduced to train binary neural networks, could effectively be applied to incorporate logical constraints into neural network learning. More specifically, we design a systematic way to represent discrete logical constraints as a loss function; minimizing this loss using gradient descent via a straightthroughestimator updates the neural network's weights in the direction that the binarized outputs satisfy the logical constraints. The experimental results show that by leveraging GPUs and batch training, this method scales significantly better than existing neurosymbolic methods that require heavy symbolic computation for computing gradients. Also, we demonstrate that our method applies to different types of neural networks, such as MLP, CNN, and GNN, making them learn with no or fewer labeled data by learning directly from known constraints.more » « less