skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics
Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2 .  more » « less
Award ID(s):
1707400
PAR ID:
10380471
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Frontiers in Artificial Intelligence
Volume:
5
ISSN:
2624-8212
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Formal certification of Neural Networks (NNs) is crucial for ensuring their safety, fairness, and robustness. Unfortunately, on the one hand, sound and complete certification algorithms of ReLU-based NNs do not scale to large-scale NNs. On the other hand, incomplete certification algorithms are easier to compute, but they result in loose bounds that deteriorate with the depth of NN, which diminishes their effectiveness. In this paper, we ask the following question; can we replace the ReLU activation function with one that opens the door to incomplete certification algorithms that are easy to compute but can produce tight bounds on the NN's outputs? We introduce DeepBern-Nets, a class of NNs with activation functions based on Bernstein polynomials instead of the commonly used ReLU activation. Bernstein polynomials are smooth and differentiable functions with desirable properties such as the so-called range enclosure and subdivision properties. We design a novel Interval Bound Propagation (IBP) algorithm, called Bern-IBP, to efficiently compute tight bounds on DeepBern-Nets outputs. Our approach leverages the properties of Bernstein polynomials to improve the tractability of neural network certification tasks while maintaining the accuracy of the trained networks. We conduct experiments in adversarial robustness and reachability analysis settings to assess the effectiveness of the approach. Our proposed framework achieves high certified accuracy for adversarially-trained NNs, which is often a challenging task for certifiers of ReLU-based NNs. This work establishes Bernstein polynomial activation as a promising alternative for improving NN certification tasks across various NNs applications. 
    more » « less
  2. We study the theory of neural network (NN) from the lens of classical nonparametric regression problems with a focus on NN's ability to adaptively estimate functions with heterogeneous smoothness -- a property of functions in Besov or Bounded Variation (BV) classes. Existing work on this problem requires tuning the NN architecture based on the function spaces and sample sizes. We consider a "Parallel NN" variant of deep ReLU networks and show that the standard weight decay is equivalent to promoting the ℓp-sparsity (0<1) of the coefficient vector of an end-to-end learned function bases, i.e., a dictionary. Using this equivalence, we further establish that by tuning only the weight decay, such Parallel NN achieves an estimation error arbitrarily close to the minimax rates for both the Besov and BV classes. Notably, it gets exponentially closer to minimax optimal as the NN gets deeper. Our research sheds new lights on why depth matters and how NNs are more powerful than kernel methods. 
    more » « less
  3. Abstract We develop a general theory of flows in the space of Riemannian metrics induced by neural network (NN) gradient descent. This is motivated in part by recent advances in approximating Calabi–Yau metrics with NNs and is enabled by recent advances in understanding flows in the space of NNs. We derive the corresponding metric flow equations, which are governed by a metric neural tangent kernel (NTK), a complicated, non-local object that evolves in time. However, many architectures admit an infinite-width limit in which the kernel becomes fixed and the dynamics simplify. Additional assumptions can induce locality in the flow, which allows for the realization of Perelman’s formulation of Ricci flow that was used to resolve the 3d Poincaré conjecture. We demonstrate that such fixed kernel regimes lead to poor learning of numerical Calabi–Yau metrics, as is expected since the associated NNs do not learn features. Conversely, we demonstrate that well-learned numerical metrics at finite-width exhibit an evolving metric-NTK, associated with feature learning. Our theory of NN metric flows therefore explains why NNs are better at learning Calabi–Yau metrics than fixed kernel methods, such as the Ricci flow. 
    more » « less
  4. We study the generalization of two-layer ReLU neural networks in a univariate nonparametric regression problem with noisy labels. This is a problem where kernels (\emph{e.g.} NTK) are provably sub-optimal and benign overfitting does not happen, thus disqualifying existing theory for interpolating (0-loss, global optimal) solutions. We present a new theory of generalization for local minima that gradient descent with a constant learning rate can \emph{stably} converge to. We show that gradient descent with a fixed learning rate η can only find local minima that represent smooth functions with a certain weighted \emph{first order total variation} bounded by 1/η−1/2+O˜(σ+MSE‾‾‾‾‾√) where σ is the label noise level, MSE is short for mean squared error against the ground truth, and O˜(⋅) hides a logarithmic factor. Under mild assumptions, we also prove a nearly-optimal MSE bound of O˜(n−4/5) within the strict interior of the support of the n data points. Our theoretical results are validated by extensive simulation that demonstrates large learning rate training induces sparse linear spline fits. To the best of our knowledge, we are the first to obtain generalization bound via minima stability in the non-interpolation case and the first to show ReLU NNs without regularization can achieve near-optimal rates in nonparametric regression. 
    more » « less
  5. We consider networks, trained via stochastic gradient descent to minimize L2 loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the datapoints, of the squared L2 of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width,depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models. 
    more » « less