Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. We consider the problem of learning a one hidden layer convolutional neural network with ReLU activation function via gradient descent under sparsity promoting penalties. It is known that when the input data is Gaussian distributed, nooverlap networks (without penalties) in regression problems with ground truth can be learned in polynomial time at high probability. We propose a relaxed variable splitting method integrating thresholding and gradient descent to overcome the nonsmoothness in the loss function. The sparsity in network weight is realized during the optimization (training) process. We prove that under L1, L0, and transformedL1 penalties, nooverlap networks can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel thresholding operation. Numerical experiments confirm theoretical findings, and compare the accuracy and sparsity tradeoff among the penalties.
Convergence of a Relaxed Variable Splitting Coarse Gradient Descent Method for Learning Sparse Weight Binarized Activation Neural Network
Sparsification of neural networks is one of the effective complexity reduction methods
to improve efficiency and generalizability. Binarized activation offers an additional
computational saving for inference. Due to vanishing gradient issue in training networks
with binarized activation, coarse gradient (a.k.a. straight through estimator) is adopted in
practice. In this paper, we study the problem of coarse gradient descent (CGD) learning of
a one hidden layer convolutional neural network (CNN) with binarized activation function
and sparse weights. It is known that when the input data is Gaussian distributed,
nooverlap one hidden layer CNN with ReLU activation and general weight can be
learned by GD in polynomial time at high probability in regression problems with ground
truth. We propose a relaxed variable splitting method integrating thresholding and coarse
gradient descent. The sparsity in network weight is realized through thresholding during
the CGD training process. We prove that under thresholding of L1, L0, and transformedL1
penalties, nooverlap binary activation CNN can be learned with high probability, and the
iterative weights converge to a global limit which is a transformation of the true weight
under a novel sparsifying operation. We found explicit error estimates of sparse weights
from the true weights.
 Award ID(s):
 1854434
 Publication Date:
 NSFPAR ID:
 10158859
 Journal Name:
 Frontiers in applied mathematics and statistics
 Volume:
 6
 Issue:
 13
 Page Range or eLocationID:
 111
 ISSN:
 22974687
 Sponsoring Org:
 National Science Foundation
More Like this


Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter and intragroup sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed L1, L1  L2, and L0) that induces sparsity onto the individual weights and L2,1 regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art.

Quantized deep neural networks (QDNNs) are attractive due to their much lower memory storage and faster inference speed than their regular fullprecision counterparts. To maintain the same performance level especially at low bitwidths, QDNNs must be retrained. Their training involves piecewise constant activation functions and discrete weights; hence, mathematical challenges arise. We introduce the notion of coarse gradient and propose the blended coarse gradient descent (BCGD) algorithm, for training fully quantized neural networks. Coarse gradient is generally not a gradient of any function but an artificial ascent direction. The weight update of BCGD goes by coarse gradient correction of a weighted average of the fullprecision weights and their quantization (the socalled blending), which yields sufficient descent in the objective value and thus accelerates the training. Our experiments demonstrate that this simple blending technique is very effective for quantization at extremely low bitwidth such as binarization. In full quantization of ResNet18 for ImageNet classification task, BCGD gives 64.36% top1 accuracy with binary weights across all layers and 4bit adaptive activation. If the weights in the first and last layers are kept in full precision, this number increases to 65.46%. As theoretical justification, we show convergence analysis of coarse gradient descent formore »

Tabacu, Lucia (Ed.)Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter and intragroup sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed ℓ1, ℓ1 − ℓ2, and ℓ0) that induces sparsity onto the individual weights and ℓ2,1 regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art.

Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard backpropagation or chain rule. An empirical way around this issue is to use a straightthrough estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the “gradient” through the modified chain rule becomes nontrivial. Since this unusual “gradient” is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a twolinearlayer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual “gradient” given by the STEmodifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, wemore »