Sparsification of neural networks is one of the effective complexity reduction methods
to improve efficiency and generalizability. Binarized activation offers an additional
computational saving for inference. Due to vanishing gradient issue in training networks
with binarized activation, coarse gradient (a.k.a. straight through estimator) is adopted in
practice. In this paper, we study the problem of coarse gradient descent (CGD) learning of
a one hidden layer convolutional neural network (CNN) with binarized activation function
and sparse weights. It is known that when the input data is Gaussian distributed,
no-overlap one hidden layer CNN with ReLU activation and general weight can be
learned by GD in polynomial time at high probability in regression problems with ground
truth. We propose a relaxed variable splitting method integrating thresholding and coarse
gradient descent. The sparsity in network weight is realized through thresholding during
the CGD training process. We prove that under thresholding of L1, L0, and transformed-L1
penalties, no-overlap binary activation CNN can be learned with high probability, and the
iterative weights converge to a global limit which is a transformation of the true weight
under a novel sparsifying operation. We found explicit error estimates of sparse weights
from the true weights.
more »
« less
Convergence of a Relaxed Variable Splitting Method for Learning Sparse Neural Networks via L1, L0, and transformed-L1 Penalties
Sparsification of neural networks is one of the effective complexity
reduction methods to improve efficiency and generalizability. We consider
the problem of learning a one hidden layer convolutional neural network with
ReLU activation function via gradient descent under sparsity promoting penalties.
It is known that when the input data is Gaussian distributed, no-overlap
networks (without penalties) in regression problems with ground truth can be
learned in polynomial time at high probability. We propose a relaxed variable
splitting method integrating thresholding and gradient descent to overcome the
non-smoothness in the loss function. The sparsity in network weight is realized
during the optimization (training) process. We prove that under L1, L0, and
transformed-L1 penalties, no-overlap networks can be learned with high probability,
and the iterative weights converge to a global limit which is a transformation
of the true weight under a novel thresholding operation. Numerical experiments
confirm theoretical findings, and compare the accuracy and sparsity
trade-off among the penalties.
more »
« less
- NSF-PAR ID:
- 10158867
- Date Published:
- Journal Name:
- Intelligent Systems Conference (IntelliSys)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter- and intra-group sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed L1, L1 - L2, and L0) that induces sparsity onto the individual weights and L2,1 regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art.more » « less
-
In the last decade, convolutional neural networks (CNNs) have evolved to become the dominant models for various computer vision tasks, but they cannot be deployed in low-memory devices due to its high memory requirement and computational cost. One popular, straightforward approach to compressing CNNs is network slimming, which imposes an L1 penalty on the channel-associated scaling factors in the batch normalization layers during training. In this way, channels with low scaling factors are identified to be insignificant and are pruned in the models. In this paper, we propose replacing the L1 penalty with the Lp and transformed L1 (TL1) penalties since these nonconvex penalties outperformed L1 in yielding sparser satisfactory solutions in various compressed sensing problems. In our numerical experiments, we demonstrate network slimming with Lp and TL1 penalties on VGGNet and Densenet trained on CIFAR 10/100. The results demonstrate that the nonconvex penalties compress CNNs better than L1. In addition, TL1 preserves the model accuracy after channel pruning, L1/2 and L3/4 yield compressed models with similar accuracies as L1 after retraining.more » « less
-
Recurrent neural networks (RNNs) have been successfully used on a wide range of sequential data problems. A well known difficulty in using RNNs is the vanishing or exploding gradient problem. Recently, there have been several different RNN architectures that try to mitigate this issue by maintaining an orthogonal or unitary recurrent weight matrix. One such architecture is the scaled Cayley orthogonal recurrent neural network (scoRNN) which parameterizes the orthogonal recurrent weight matrix through a scaled Cayley transform. This parametrization contains a diagonal scaling matrix consisting of positive or negative one entries that can not be optimized by gradient descent. Thus the scaling matrix is fixed before training and a hyperparameter is introduced to tune the matrix for each particular task. In this paper, we develop a unitary RNN architecture based on a complex scaled Cayley transform. Unlike the real orthogonal case, the transformation uses a diagonal scaling matrix consisting of entries on the complex unit circle which can be optimized using gradient descent and no longer requires the tuning of a hyperparameter. We also provide an analysis of a potential issue of the modReLU activiation function which is used in our work and several other unitary RNNs. In the experiments conducted, the scaled Cayley unitary recurrent neural network (scuRNN) achieves comparable or better results than scoRNN and other unitary RNNs without fixing the scaling matrix.more » « less
-
Tabacu, Lucia (Ed.)Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter- and intra-group sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed ℓ1, ℓ1 − ℓ2, and ℓ0) that induces sparsity onto the individual weights and ℓ2,1 regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art.more » « less