We consider the problem of finding a twolayer neural
network with sigmoid, rectified linear unit (ReLU), or binary step
activation functions that “fits” a training data set as accurately as
possible as quantified by the training error; and study the following
question: does a low training error guarantee that the norm of the
output layer (outer norm) itself is small? We answer affirmatively
this question for the case of nonnegative output weights. Using a
simple covering number argument, we establish that under quite
mild distributional assumptions on the input/label pairs; any such
network achieving a small training error on polynomially many
data necessarily has a wellcontrolled outer norm. Notably, our
results (a) have a polynomial (in d) sample complexity, (b) are
independent of the number of hidden units (which can potentially
be very high), (c) are oblivious to the training algorithm; and
(d) require quite mild assumptions on the data (in particular the
input vector X ∈ Rd need not have independent coordinates). We
then leverage our bounds to establish generalization guarantees for
such networks through fatshattering dimension, a scalesensitive
measure of the complexity class that the network architectures we
investigate belong to. Notably, our generalization bounds also have
good sample complexity (polynomials in d with a low degree), and
are in fact nearlinear for some important cases of interest.
more »
« less
SelfRegularity of NonNegative Output Weightsfor Overparameterized TwoLayer Neural Networks
We consider the problem of finding a twolayer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that "fits" a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmatively this question for the case of nonnegative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a wellcontrolled outer norm. Notably, our results (a) have a polynomial (in d) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector X∈ℝd need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through \emph{fatshattering dimension}, a scalesensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have good sample complexity (polynomials in d with a low degree), and are in fact nearlinear for some important cases of interest.
more »
« less
 Award ID(s):
 2022448
 NSFPAR ID:
 10279503
 Date Published:
 Journal Name:
 International Symposium on Information Theory
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Generalization error bounds are essential to understanding machine learning algorithms. This paper presents novel expected generalization error upper bounds based on the average joint distribution between the output hypothesis and each input training sample. Multiple generalization error upper bounds based on different information measures are provided, including Wasserstein distance, total variation distance, KL divergence, and JensenShannon divergence. Due to the convexity of the information measures, the proposed bounds in terms of Wasserstein distance and total variation distance are shown to be tighter than their counterparts based on individual samples in the literature. An example is provided to demonstrate the tightness of the proposed generalization error bounds.more » « less

A recent line of work studies overparametrized neural networks in the “kernel regime,” i.e. when during training the network behaves as a kernelized linear predictor, and thus, training with gradient descent has the effect of finding the corresponding minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized networks can induce rich implicit biases that are not RKHS norms. Building on an observation by \citet{chizat2018note}, we show how the \textbf{\textit{scale of the initialization}} controls the transition between the “kernel” (aka lazy) and “rich” (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a family of simple depthD linear networks that exhibit an interesting and meaningful transition between the kernel and rich regimes, and highlight an interesting role for the \emph{width} of the models. We further demonstrate this transition empirically for matrix factorization and multilayer nonlinear networks.more » « less

Various approaches have been developed to upper bound the generalization error of a supervised learning algorithm. However, existing bounds are often loose and lack of guarantees. As a result, they may fail to characterize the exact generalization ability of a learning algorithm.Our main contribution is an exact characterization of the expected generalization error of the wellknown Gibbs algorithm (a.k.a. Gibbs posterior) using symmetrized KL information between the input training samples and the output hypothesis. Our result can be applied to tighten existing expected generalization error and PACBayesian bounds. Our approach is versatile, as it also characterizes the generalization error of the Gibbs algorithm with datadependent regularizer and that of the Gibbs algorithm in the asymptotic regime, where it converges to the empirical risk minimization algorithm. Of particular relevance, our results highlight the role the symmetrized KL information plays in controlling the generalization error of the Gibbs algorithm.more » « less

Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop marginbased generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature. The code is available at https://github.com/chingyaoc/kVMargin.more » « less