Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. We consider the problem of learning a one hidden layer convolutional neural network with ReLU activation function via gradient descent under sparsity promoting penalties. It is known that when the input data is Gaussian distributed, nooverlap networks (without penalties) in regression problems with ground truth can be learned in polynomial time at high probability. We propose a relaxed variable splitting method integrating thresholding and gradient descent to overcome the nonsmoothness in the loss function. The sparsity in network weight is realized during the optimization (training) process. We prove that under L1, L0, and transformedL1 penalties, nooverlap networks can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel thresholding operation. Numerical experiments confirm theoretical findings, and compare the accuracy and sparsity tradeoff among the penalties.
Sparse and smooth signal estimation: Convexification of L0 formulations
Signal estimation problems with smoothness and sparsity priors can be naturally modeled
as quadratic optimization with L0“norm” constraints. Since such problems are nonconvex
and hardtosolve, the standard approach is, instead, to tackle their convex surrogates based
on L1norm relaxations. In this paper, we propose new iterative (convex) conic quadratic
relaxations that exploit not only the L0“norm” terms, but also the fitness and smoothness
functions. The iterative convexification approach substantially closes the gap between the
L0“norm” and its L1 surrogate. These stronger relaxations lead to significantly better
estimators than L1norm approaches and also allow one to utilize affine sparsity priors. In
addition, the parameters of the model and the resulting estimators are easily interpretable.
Experiments with a tailored Lagrangian decomposition method indicate that the proposed
iterative convex relaxations yield solutions within 1% of the exact L0approach, and can
tackle instances with up to 100,000 variables under one minute.
 Editors:
 Mirrokni, V
 Award ID(s):
 1818700
 Publication Date:
 NSFPAR ID:
 10289594
 Journal Name:
 Journal of machine learning research
 Volume:
 22
 Issue:
 52
 Page Range or eLocationID:
 143
 ISSN:
 15324435
 Sponsoring Org:
 National Science Foundation
More Like this


We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. We show that neural networks with rectified linear units act as convex regularizers, where simple solutions are encouraged via extreme points of a certain convex set. For one dimensional regression and classification, as well as rankone data matrices, we prove that finite twolayer ReLU networks with norm regularization yield linear spline interpolation. We characterize the classification decision regions in terms of a closed form kernel matrix and minimum L1 norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain neural network predictions with finitely many neurons. Our convex geometric description also provides intuitive explanations of hidden neurons as auto encoders. In higher dimensions, we show that the training problem for twolayer networks can be cast as a finite dimensional convex optimization problem with infinitely many constraints. We then provide a family of convex relaxations to approximate the solution, and a cuttingplane algorithm to improve the relaxations. We derive conditions for the exactness of the relaxations and provide simple closed form formulas for the optimal neural network weights in certain cases. We alsomore »

Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. Binarized activation offers an additional computational saving for inference. Due to vanishing gradient issue in training networks with binarized activation, coarse gradient (a.k.a. straight through estimator) is adopted in practice. In this paper, we study the problem of coarse gradient descent (CGD) learning of a one hidden layer convolutional neural network (CNN) with binarized activation function and sparse weights. It is known that when the input data is Gaussian distributed, nooverlap one hidden layer CNN with ReLU activation and general weight can be learned by GD in polynomial time at high probability in regression problems with ground truth. We propose a relaxed variable splitting method integrating thresholding and coarse gradient descent. The sparsity in network weight is realized through thresholding during the CGD training process. We prove that under thresholding of L1, L0, and transformedL1 penalties, nooverlap binary activation CNN can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel sparsifying operation. We found explicit error estimates of sparse weights from the true weights.

We develop a convex analytic approach to analyze finite width twolayer ReLU networks. We first prove that an optimal solution to the regularized training problem can be characterized as extreme points of a convex set, where simple solutions are encouraged via its convex geometrical properties. We then leverage this characterization to show that an optimal set of parameters yield linear spline interpolation for regression problems involving one dimensional or rankone data. We also characterize the classification decision regions in terms of a kernel matrix and minimum `1norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain predictions of finite width networks. Our convex geometric characterization also provides intuitive explanations of hidden neurons as autoencoders. In higher dimensions, we show that the training problem can be cast as a finite dimensional convex problem with infinitely many constraints. Then, we apply certain convex relaxations and introduce a cuttingplane algorithm to globally optimize the network. We further analyze the exactness of the relaxations to provide conditions for the convergence to a global optimum. Our analysis also shows that optimal network parameters can be also characterized as interpretable closedform formulas in some practically relevant special cases.

Abstract We study the lowrank phase retrieval problem, where our goal is to recover a $d_1\times d_2$ lowrank matrix from a series of phaseless linear measurements. This is a fourthorder inverse problem, as we are trying to recover factors of a matrix that have been observed, indirectly, through some quadratic measurements. We propose a solution to this problem using the recently introduced technique of anchored regression. This approach uses two different types of convex relaxations: we replace the quadratic equality constraints for the phaseless measurements by a search over a polytope and enforce the rank constraint through nuclear norm regularization. The result is a convex program in the space of $d_1 \times d_2$ matrices. We analyze two specific scenarios. In the first, the target matrix is rank$1$, and the observations are structured to correspond to a phaseless blind deconvolution. In the second, the target matrix has general rank, and we observe the magnitudes of the inner products against a series of independent Gaussian random matrices. In each of these problems, we show that anchored regression returns an accurate estimate from a nearoptimal number of measurements given that we have access to an anchor matrix of sufficient quality. We also showmore »