A wide variety of activation functions have been proposed for neural networks. The Rectified Linear Unit (ReLU) is especially popular today. There are many practical reasons that motivate the use of the ReLU. This paper provides new theoretical characterizations that support the use of the ReLU, its variants such as the leaky ReLU, as well as other activation functions in the case of univariate, singlehidden layer feedforward neural networks. Our results also explain the importance of commonly used strategies in the design and training of neural networks such as “weight decay” and “pathnorm” regularization, and provide a new justification for the use of “skip connections” in network architectures. These new insights are obtained through the lens of spline theory. In particular, we show how neural network training problems are related to infinitedimensional optimizations posed over Banach spaces of functions whose solutions are wellknown to be fractional and polynomial splines, where the particular Banach space (which controls the order of the spline) depends on the choice of activation function.
more »
« less
Banach Space Representer Theorems for Neural Networks and Ridge Splines
We develop a variational framework to understand the properties of the functions learned
by neural networks fit to data. We propose and study a family of continuousdomain linear
inverse problems with total variationlike regularization in the Radon domain subject to
data fitting constraints. We derive a representer theorem showing that finitewidth, singlehidden layer neural networks are solutions to these inverse problems. We draw on many
techniques from variational spline theory and so we propose the notion of polynomial ridge
splines, which correspond to singlehidden layer neural networks with truncated power
functions as the activation function. The representer theorem is reminiscent of the classical
reproducing kernel Hilbert space representer theorem, but we show that the neural network
problem is posed over a nonHilbertian Banach space. While the learning problems are
posed in the continuousdomain, similar to kernel methods, the problems can be recast
as finitedimensional neural network training problems. These neural network training
problems have regularizers which are related to the wellknown weight decay and pathnorm
regularizers. Thus, our result gives insight into functional characteristics of trained neural
networks and also into the design neural network regularizers. We also show that these
regularizers promote neural network solutions with desirable generalization properties.
Keywords: neural networks, splines, inverse problems, regularization, sparsity
more »
« less
 Award ID(s):
 2023239
 NSFPAR ID:
 10533127
 Publisher / Repository:
 Journal of Machine Learning Research
 Date Published:
 Journal Name:
 Journal of machine learning research
 ISSN:
 15324435
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)We develop a convex analytic framework for ReLU neural networks which elucidates the inner workings of hidden neurons and their function space characteristics. We show that neural networks with rectified linear units act as convex regularizers, where simple solutions are encouraged via extreme points of a certain convex set. For one dimensional regression and classification, as well as rankone data matrices, we prove that finite twolayer ReLU networks with norm regularization yield linear spline interpolation. We characterize the classification decision regions in terms of a closed form kernel matrix and minimum L1 norm solutions. This is in contrast to Neural Tangent Kernel which is unable to explain neural network predictions with finitely many neurons. Our convex geometric description also provides intuitive explanations of hidden neurons as auto encoders. In higher dimensions, we show that the training problem for twolayer networks can be cast as a finite dimensional convex optimization problem with infinitely many constraints. We then provide a family of convex relaxations to approximate the solution, and a cuttingplane algorithm to improve the relaxations. We derive conditions for the exactness of the relaxations and provide simple closed form formulas for the optimal neural network weights in certain cases. We also establish a connection to ℓ0ℓ1 equivalence for neural networks analogous to the minimal cardinality solutions in compressed sensing. Extensive experimental results show that the proposed approach yields interpretable and accurate models.more » « less

Sparsity of a learning solution is a desirable feature in machine learning. Certain reproducing kernel Banach spaces (RKBSs) are appropriate hypothesis spaces for sparse learning methods. The goal of this paper is to understand what kind of RKBSs can promote sparsity for learning solutions. We consider two typical learning models in an RKBS: the minimum norm interpolation (MNI) problem and the regularization problem. We first establish an explicit representer theorem for solutions of these problems, which represents the extreme points of the solution set by a linear combination of the extreme points of the subdifferential set, of the norm function, which is datadependent. We then propose sufficient conditions on the RKBS that can transform the explicit representation of the solutions to a sparse kernel representation having fewer terms than the number of the observed data. Under the proposed sufficient conditions, we investigate the role of the regularization parameter on sparsity of the regularized solutions. We further show that two specific RKBSs, the sequence space l_1(N) and the measure space, can have sparse representer theorems for both MNI and regularization models.more » « less

There has been significant recent interest in the use of deep learning for regularizing imaging inverse problems. Most work in the area has focused on regularization imposed implicitly by convolutional neural networks (CNNs) pretrained for image reconstruction. In this work, we follow an alternative line of work based on learning explicit regularization functionals that promote preferred solutions. We develop the Explicit Learned Deep Equilibrium Regularizer (ELDER) method for learning explicit regularization functionals that minimize a meansquared error (MSE) metric. ELDER is based on a regularization functional parameterized by a CNN and a deep equilibrium learning (DEQ) method for training the functional to be MSEoptimal at the fixed points of the reconstruction algorithm. The explicit regularizer enables ELDER to directly inherit fundamental convergence results from optimization theory. On the other hand, DEQ training enables ELDER to improve over existing explicit regularizers without prohibitive memory complexity during training. We use ELDER to train several approaches to parameterizing explicit regularizers and test their performance on three distinct imaging inverse problems. Our results show that ELDER can greatly improve the quality of explicit regularizers compared to existing methods, and show that learning explicit regularizers does not compromise performance relative to methods based on implicit regularization.more » « less

We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For univariate regression, we show that the solution of training a widthn shallow ReLU network is within n−1/2 of the function which fits the training data and whose difference from the initial function has the smallest 2norm of the second derivative weighted by a curvature penalty that depends on the probability distribution that is used to initialize the network parameters. We compute the curvature penalty function explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. For stochastic gradient descent we obtain the same implicit bias result. We obtain a similar result for different activation functions. For multivariate regression we show an analogous result, whereby the second derivative is replaced by the Radon transform of a fractional Laplacian. For initialization schemes that yield a constant penalty function, the solutions are polyharmonic splines. Moreover, we show that the training trajectories are captured by trajectories of smoothing splines with decreasing regularization strength.more » « less