We consider the problem of finding a two-layer neural
network with sigmoid, rectified linear unit (ReLU), or binary step
activation functions that “fits” a training data set as accurately as
possible as quantified by the training error; and study the following
question: does a low training error guarantee that the norm of the
output layer (outer norm) itself is small? We answer affirmatively
this question for the case of non-negative output weights. Using a
simple covering number argument, we establish that under quite
mild distributional assumptions on the input/label pairs; any such
network achieving a small training error on polynomially many
data necessarily has a well-controlled outer norm. Notably, our
results (a) have a polynomial (in d) sample complexity, (b) are
independent of the number of hidden units (which can potentially
be very high), (c) are oblivious to the training algorithm; and
(d) require quite mild assumptions on the data (in particular the
input vector X ∈ Rd need not have independent coordinates). We
then leverage our bounds to establish generalization guarantees for
such networks through fat-shattering dimension, a scale-sensitive
measure of the complexity class that the network architectures we
investigate belong to. Notably, our generalization bounds also have
good sample complexity (polynomials in d with a low degree), and
are in fact near-linear for some important cases of interest.
more »
« less
Self-Regularity of Non-Negative Output Weightsfor Overparameterized Two-Layer Neural Networks
We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that "fits" a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmatively this question for the case of non-negative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a well-controlled outer norm. Notably, our results (a) have a polynomial (in d) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector X∈ℝd need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through \emph{fat-shattering dimension}, a scale-sensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have good sample complexity (polynomials in d with a low degree), and are in fact near-linear for some important cases of interest.
more »
« less
- Award ID(s):
- 2022448
- NSF-PAR ID:
- 10279503
- Date Published:
- Journal Name:
- International Symposium on Information Theory
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In lifelong learning, tasks (or classes) to be learned arrive sequentially over time in arbitrary order. During training, knowledge from previous tasks can be captured and transferred to subsequent ones to improve sample efficiency. We consider the setting where all target tasks can be represented in the span of a small number of unknown linear or nonlinear features of the input data. We propose a lifelong learning algorithm that maintains and refines the internal feature representation. We prove that for any desired accuracy on all tasks, the dimension of the representation remains close to that of the underlying representation. The resulting sample complexity improves significantly on existing bounds. In the setting of linear features, our algorithm is provably efficient and the sample complexity for input dimension d, m tasks with k features up to error ϵ is O~(dk1.5/ϵ+km/ϵ). We also prove a matching lower bound for any lifelong learning algorithm that uses a single task learner as a black box. We complement our analysis with an empirical study, including a heuristic lifelong learning algorithm for deep neural networks. Our method performs favorably on challenging realistic image datasets compared to state-of-the-art continual learning methods.more » « less
-
null (Ed.)We investigate the approximability of the following optimization problem. The input is an n× n matrix A=(Aij) with real entries and an origin-symmetric convex body K⊂ ℝn that is given by a membership oracle. The task is to compute (or approximate) the maximum of the quadratic form ∑i=1n∑j=1n Aij xixj=⟨ x,Ax⟩ as x ranges over K. This is a rich and expressive family of optimization problems; for different choices of matrices A and convex bodies K it includes a diverse range of optimization problems like max-cut, Grothendieck/non-commutative Grothendieck inequalities, small set expansion and more. While the literature studied these special cases using case-specific reasoning, here we develop a general methodology for treatment of the approximability and inapproximability aspects of these questions. The underlying geometry of K plays a critical role; we show under commonly used complexity assumptions that polytime constant-approximability necessitates that K has type-2 constant that grows slowly with n. However, we show that even when the type-2 constant is bounded, this problem sometimes exhibits strong hardness of approximation. Thus, even within the realm of type-2 bodies, the approximability landscape is nuanced and subtle. However, the link that we establish between optimization and geometry of Banach spaces allows us to devise a generic algorithmic approach to the above problem. We associate to each convex body a new (higher dimensional) auxiliary set that is not convex, but is approximately convex when K has a bounded type-2 constant. If our auxiliary set has an approximate separation oracle, then we design an approximation algorithm for the original quadratic optimization problem, using an approximate version of the ellipsoid method. Even though our hardness result implies that such an oracle does not exist in general, this new question can be solved in specific cases of interest by implementing a range of classical tools from functional analysis, most notably the deep factorization theory of linear operators. Beyond encompassing the scenarios in the literature for which constant-factor approximation algorithms were found, our generic framework implies that that for convex sets with bounded type-2 constant, constant factor approximability is preserved under the following basic operations: (a) Subspaces, (b) Quotients, (c) Minkowski Sums, (d) Complex Interpolation. This yields a rich family of new examples where constant factor approximations are possible, which were beyond the reach of previous methods. We also show (under commonly used complexity assumptions) that for symmetric norms and unitarily invariant matrix norms the type-2 constant nearly characterizes the approximability of quadratic maximization.more » « less
-
Braverman, Mark (Ed.)A longstanding open problem in coding theory is to determine the best (asymptotic) rate R₂(δ) of binary codes with minimum constant (relative) distance δ. An existential lower bound was given by Gilbert and Varshamov in the 1950s. On the impossibility side, in the 1970s McEliece, Rodemich, Rumsey and Welch (MRRW) proved an upper bound by analyzing Delsarte’s linear programs. To date these results remain the best known lower and upper bounds on R₂(δ) with no improvement even for the important class of linear codes. Asymptotically, these bounds differ by an exponential factor in the blocklength. In this work, we introduce a new hierarchy of linear programs (LPs) that converges to the true size A^{Lin}₂(n,d) of an optimum linear binary code (in fact, over any finite field) of a given blocklength n and distance d. This hierarchy has several notable features: 1) It is a natural generalization of the Delsarte LPs used in the first MRRW bound. 2) It is a hierarchy of linear programs rather than semi-definite programs potentially making it more amenable to theoretical analysis. 3) It is complete in the sense that the optimum code size can be retrieved from level O(n²). 4) It provides an answer in the form of a hierarchy (in larger dimensional spaces) to the question of how to cut Delsarte’s LP polytopes to approximate the true size of linear codes. We obtain our hierarchy by generalizing the Krawtchouk polynomials and MacWilliams inequalities to a suitable "higher-order" version taking into account interactions of 𝓁 words. Our method also generalizes to translation schemes under mild assumptions.more » « less
-
Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1-hidden-layer networks with fewer than n (sample size) neurons when the activation is smooth. First, we prove that as long as the width m>=2n=d (where d is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.more » « less