<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ON FEATURE LEARNING IN NEURAL NETWORKSWITH GLOBAL CONVERGENCE GUARANTEES</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>05/09/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10329479</idno>
					<idno type="doi"></idno>
					<title level='j'>international conference on learning representations (ICLR)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhengdao Chen</author><author>Eric Vanden-Eijnden</author><author>Joan Bruna</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study the optimization of wide neural networks (NNs) via gradient flow (GF)in setups that allow feature learning while admitting non-asymptotic global convergenceguarantees. First, for wide shallow NNs under the mean-field scalingand with a general class of activation functions, we prove that when the inputdimension is no less than the size of the training set, the training loss converges tozero at a linear rate under GF. Building upon this analysis, we study a model ofwide multi-layer NNs whose second-to-last layer is trained via GF, for which wealso prove a linear-rate convergence of the training loss to zero, but regardless ofthe input dimension. We also show empirically that, unlike in the Neural TangentKernel (NTK) regime, our multi-layer model exhibits feature learning and canachieve better generalization performance than its NTK counterpart.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The training of neural networks (NNs) is typically a non-convex optimization problem, but remarkably, simple algorithms like gradient descent (GD) or its variants can usually succeed in finding solutions with low training losses. To understand this phenomenon, a promising idea is to focus on NNs with large widths (a.k.a. under over-parameterization), for which we can derive infinite-width limits under suitable ways to scale the parameters by the widths. For example, under a "1/ &#8730; width" scaling of the weights, the GD dynamics of wide NNs can be approximated by the linearized dynamics around initialization, and as the widths tend to infinity, we obtain the Neural Tangent Kernel (NTK) limit of NNs, where the solution obtained by GD coincides with a kernel method <ref type="bibr">[37]</ref>. Importantly, theoretical guarantees for optimization and generalization can be obtained for wide NNs under this scaling <ref type="bibr">[19,</ref><ref type="bibr">5]</ref>. Nonetheless, it was pointed out that this NTK analysis replies on a form of lazy training that excludes the learning of features or representations <ref type="bibr">[16,</ref><ref type="bibr">72]</ref>, which is a crucial ingredient to the success of deep learning, and is therefore not adequate for explaining the success of NNs <ref type="bibr">[27,</ref><ref type="bibr">41]</ref>. Meanwhile, for shallow (i.e., one-hidden-layer) NNs, if we choose a "1 / width" scaling, we can derive an alternative mean-field (MF) limit as the widths tend to infinity. Under this scaling, feature learning occurs even in the infinite-width limit, and the training dynamics can be described by the Wasserstein gradient flow of a probability measure on the space of the parameters, which converges to a global minimizer of the loss function under certain conditions <ref type="bibr">[59,</ref><ref type="bibr">50,</ref><ref type="bibr">63,</ref><ref type="bibr">14]</ref>. Generalization guarantees have also been proved for learning with shallow NNs under the MF scaling by identifying a corresponding function space <ref type="bibr">[6,</ref><ref type="bibr">48]</ref>. However, currently there are three limitations to this model of over-parameterized NNs. First, the global convergence guarantees for shallow NNs only hold in the infinite-width limit (i.e. they are asymptotic). While <ref type="bibr">[12]</ref> studies the deviation between finite-width NNs and their infinite-width limits during training, the analysis is done only asymptotically to the next order in width. Second, a convergence rate has yet to be established except under special assumptions or with modifications to the GD algorithm <ref type="bibr">[13,</ref><ref type="bibr">34,</ref><ref type="bibr">43,</ref><ref type="bibr">38]</ref>. Third, while several works have proposed to extend the MF formulation to deep (i.e., multi-layer) NNs <ref type="bibr">[4,</ref><ref type="bibr">62,</ref><ref type="bibr">52,</ref><ref type="bibr">23,</ref><ref type="bibr">57]</ref>, there is less concensus on what the right model should be than for the shallow case. In summary, we still lack a model for the GD optimization of shallow and multi-layer NNs that goes beyond lazy training while admitting fast global convergence.</p><p>In this work, we study the optimization of both shallow NNs under the MF scaling and a type of partially-trained multi-layer NNs, and obtain theoretical guarantees of linear-rate global convergence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">SUMMARY OF MAIN CONTRIBUTIONS</head><p>We consider the scenario of training NN models to fit a training set of n data points in dimension d, where the model parameters are optimized by gradient flow (GF, which is the continuous-time limit of GD) with respect to the squared loss. Allowing most choices of the activation function, we prove that:</p><p>1. For a shallow NN, if the hidden layer is sufficiently wide and the input data are linearly independent (requiring n &#8804; d), then with high probability, the training loss converges to zero at a linear rate.</p><p>2. For a multi-layer NN where we only train the second-to-last layer, if the hidden layers are both sufficiently wide, then with high probability, the training loss converges to zero at a linear rate. Unlike for shallow NNs, here we no longer need the requirement on input dimension, demonstrating a benefit of jointly having depth and width.</p><p>We also run numerical experiments to demonstrate that our model exhibits feature learning and can achieve better generalization performance than its NTK counterpart.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">RELATED WORKS</head><p>Over-parameterized NNs, NTK and lazy training. Many recent works have studied the optimization landscape of NNs and the benefits of over-parameterization <ref type="bibr">[24,</ref><ref type="bibr">67,</ref><ref type="bibr">60,</ref><ref type="bibr">66,</ref><ref type="bibr">64,</ref><ref type="bibr">54,</ref><ref type="bibr">65,</ref><ref type="bibr">42,</ref><ref type="bibr">3,</ref><ref type="bibr">40,</ref><ref type="bibr">76,</ref><ref type="bibr">15,</ref><ref type="bibr">75,</ref><ref type="bibr">17,</ref><ref type="bibr">39,</ref><ref type="bibr">9,</ref><ref type="bibr">32,</ref><ref type="bibr">20,</ref><ref type="bibr">31,</ref><ref type="bibr">61,</ref><ref type="bibr">10]</ref>. One influential idea is the Neural Tangent Kernel (NTK) <ref type="bibr">[37]</ref>, which characterizes the behavior of GD on the infinite-width limit of NNs under a particular scaling of the parameters (e.g. for shallow NNs, replacing 1/m with 1/ &#8730; m in (1)). In particular, when the network width is polynomially large in the size of the training set, the training loss converges to a global minimum at a linear rate under GD <ref type="bibr">[19,</ref><ref type="bibr">5,</ref><ref type="bibr">55]</ref>. Nonetheless, in the NTK limit, due to a relatively large scaling of the parameters at initialization, the hidden-layer features do not move significantly <ref type="bibr">[37,</ref><ref type="bibr">19,</ref><ref type="bibr">16]</ref>. For this reason, the NTK scaling has been called the lazy-training regime, as opposed to a feature-learning or rich regime <ref type="bibr">[16,</ref><ref type="bibr">72,</ref><ref type="bibr">26]</ref>. Several works have investigated the differences between the two regimes both in theory <ref type="bibr">[28,</ref><ref type="bibr">29,</ref><ref type="bibr">69,</ref><ref type="bibr">47]</ref> and in practice <ref type="bibr">[27,</ref><ref type="bibr">41]</ref>. In addition, several works have generalized the NTK analysis by considering higher-order Taylor approximations of the GD dynamics or finite-width corrections to the NTK <ref type="bibr">[2,</ref><ref type="bibr">35,</ref><ref type="bibr">7,</ref><ref type="bibr">33]</ref>.</p><p>Mean-field theory of NNs. An alternative path has been taken to study shallow NNs in the meanfield scaling (as in ( <ref type="formula">1</ref>)), where the infinite-width limit is analogous to the thermodynamic or hydrodynamic limit of interacting particle systems <ref type="bibr">[59,</ref><ref type="bibr">50,</ref><ref type="bibr">63,</ref><ref type="bibr">14,</ref><ref type="bibr">49,</ref><ref type="bibr">70]</ref>. Thanks to the interchangeability of the parameters, the neural network is equivalently characterized by a probability measure on the space of its parameters, and the training can then be described by a Wasserstein gradient flow followed by this probability measure, which, in the infinite-width limit, converges to global mimima under mild conditions. Regarding convergence rate, ref. <ref type="bibr">[71]</ref> proves that if we train a shallow NN to fit a Lipschitz target function under population loss, the convergence rate cannot beat the curse of dimensionality. In contrast, we will study the setting of empirical risk minimization, where there are finitely many training data. Ref. <ref type="bibr">[34]</ref> shows that mean field Langevin dynamics on shallow NNs can converge exponentially to global minimizers in over-regularized scenarios, but we focus on GF without entropic regularization. Besides the question of optimization, shallow NNs under this scaling represent functions in the Barron space <ref type="bibr">[48]</ref> or variation-norm function space <ref type="bibr">[6]</ref>, which provide theoretical guarantees on generalization as well as fluctuation in training <ref type="bibr">[12]</ref>. Several works have proposed different mean-field limits of wide multi-layer NNs and proved convergence guarantees <ref type="bibr">[4,</ref><ref type="bibr">62,</ref><ref type="bibr">51,</ref><ref type="bibr">52,</ref><ref type="bibr">23,</ref><ref type="bibr">57,</ref><ref type="bibr">22]</ref>, but questions remain. First, due to the presence of different symmetries in a multi-layer network compared to a shallow network <ref type="bibr">[57]</ref>, the limiting object at the infinite-width limit is often quite complicated. Second, it has been pointed out that under the MF scaling of a multi-layer network, an i.i.d. initialization of the weights would lead to a collapse of the diversity of neurons in the middle layers, diminishing the effect of having large widths <ref type="bibr">[23]</ref>. In addition, while another line of work develops MF models of residual models <ref type="bibr">[8,</ref><ref type="bibr">46,</ref><ref type="bibr">21,</ref><ref type="bibr">36]</ref>, we are interested in multi-layer NN models with a large width in every layer.</p><p>Feature learning in deep NNs. Ref. <ref type="bibr">[1]</ref> demonstrates the importance of hierarchical learning by proving the existence of concept classes that can be learned efficiently by a deep NN with quadratic activations but not by non-hierarchical models. Ref. <ref type="bibr">[11]</ref> studies the optimization landscape and generalization properties of a hierarchical model that is similar to ours in spirit, where an untrained embedding of the input is passed into a trainable shallow model, and prove an improvement in sample complexity in learning polynomials by having neural network outputs as the embedding. However, the trainable models they consider are not shallow NNs but their linearized and quadratic-Taylor approximations, and furthermore the convergence rate of the training is not known. Ref. <ref type="bibr">[73]</ref> proposes a novel parameterization under which there exists an infinite-width limit of deep NNs that exhibits feature learning, but properties of its training dynamics is not well-understood. Our multi-layer NN models adopt an equivalent scaling (see Appendix C), and our focus is on proving non-asymptotic convergence guarantees for its partial training under GF.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PROBLEM SETUP</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">MODEL</head><p>We summarize our notations in Appendix A. Let &#8486; &#8838; R d denote the input space, and let x = [x 1 , ..., x d ] &#8890; &#8712; &#8486; denote a generic input data vector. A shallow NN model under the MF scaling can be written as:</p><p>where m is the width, W &#8712; R m&#215;d and c = [c 1 , ..., c m ] &#8712; R m are the first-and second-layer weight parameters of the model, and &#963; : R &#8594; R is the activation function. For simplicity of presentation, we neglect the bias terms. In this paper, we study a more general type of models with the following form:</p><p>&#8704;i &#8712; [m] :</p><p>where W &#8712; R m&#215;D and c = [c 1 , ..., c m ] &#8712; R m are parameters of the model, and &#966; 1 , ..., &#966; D are a set of functions from &#8486; to R that we call the embedding. Each of h 1 , ..., h m is a function from &#8486; to R, and we will refer to them as the (hidden-layer) feature map or activations. For simplicity, we write</p><p>We consider two types of the embedding, &#934;, as described below:</p><p>Fixed embedding D is fixed and &#934; is deterministic. In the simplest example, we set D = d and &#966; j (x) = x j , &#8704;j &#8712; [D], and recover the shallow NN model in <ref type="bibr">(1)</ref>. More generally, our definition includes cases where &#934; is a deterministic transformation of an input vector in &#8486; into an embedding vector in R D . This can be understood as input pre-processing or feature engineering.</p><p>High-dimensional random embedding D is large and &#934; is random. For instance, we can sample each z j i.i.d. in R d and set &#966; j (x) = &#963; 1 &#8730; d z &#8890; j x , equivalent to setting &#966; 1 , ..., &#966; m as the hidden-layer activations of a shallow NN with randomly-initialized first-layer weights. Then, the model becomes</p><p>&#8704;i &#8712; [m] :</p><p>Thus, we obtain a 3-layer feed-forward NN whose first-layer weights are random and fixed, and we call it a partially-trained 3-layer (P-3L) NN. Note that the scaling in this model is different from both the NTK scaling (1/ &#8730; m instead of 1/m in (4)) and the MF scaling for multi-layer NNs adopted in <ref type="bibr">[4,</ref><ref type="bibr">62,</ref><ref type="bibr">51,</ref><ref type="bibr">57,</ref><ref type="bibr">23]</ref> (1/D instead of 1/ &#8730; D in ( <ref type="formula">5</ref>)). We show in Appendix B that when &#963; is homogeneous, this scaling is consistent with the Xavier initialization of neural network parameters up to a reparameterization <ref type="bibr">[30,</ref><ref type="bibr">56]</ref>. We also show in Appendix C that in certain cases this scaling is equivalent to the maximum-update parameterization proposed in <ref type="bibr">[73]</ref>. Numerical experiments that compare different scalings are described in Section 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">TRAINING WITH GRADIENT FLOW</head><p>Consider the scenario of supervised least-squares regression, where we are given a set of n training data points together with their target values, (x 1 , y 1 ), ..., (x n , y n ) &#8712; &#8486; &#215; R. We fit our models by minimizing the empirical squared loss:</p><p>To do so, we first initialize the parameters randomly by sampling each c i and W ij i.i.d. from probability measures &#960; c &#8712; P(R) and &#960; w &#8712; P(R D ), respectively, and then perform GD on W . For simplicity of analysis, we leave c untrained, and further assume that Assumption 1. &#960; c = 1 2 &#948; &#265;(dc) + 1 2 &#948; -&#265; (dc) for some &#265; &gt; 0 independent from m, which is the law of a scaled Rademacher random variable.</p><p>If &#963; is Lipschitz, it is differentiable almost everywhere, and we write &#963; &#8242; (x) to denote the derivative of &#963; when it is differentiable at x and 0 otherwise. When &#963; is differentiable at</p><p>and the gradient of the loss function with respect to W ij is given by</p><p>Thus, we can perform GD updates on W according to the following rule: &#8704;i &#8712; [m] and &#8704;j &#8712; [D],</p><p>where &#948; &gt; 0 is the step size. As we discuss in Appendix B, this is consistent with the standard GD update rule for Xavier-initialzed NNs. In the limit of infinitesimal step size (&#948; &#8594; 0), the evolution of the parameters during training is described by the GF equation: if we use the superscript t &#8805; 0 to denote time elaposed during training, the time-derivative of the parameters is given by</p><p>where f t denotes the output function and h t 1 , ..., h t m denote the hidden-layer feature maps determined by the parameters at time t. Then, induced by the evolution of W t , each h t i evolves according to</p><p>where we define a kernel function</p><p>Accordingly, the output function</p><p>Thus, the loss function</p><p>where we define the (symmetric</p><p>and use &#955; min (G) to denote its least eigenvalue. We will also use</p><p>G aa to denote the minimum and maximum diagonal entries of G, respectively. Since G is positive semi-definite, we see that Lt &#8804; 0, which means that the loss value is indeed non-increasing along the GF trajectory.</p><p>Feature learning Compared to the NTK scaling of neural networks, the crucial difference is the 1/m factor in (2), instead of 1/ &#8730; m. It is known that under the NTK scaling, due to the 1/ &#8730; m factor, the movement of the feature maps, h 1 , ..., h m , is only of order O(1/ &#8730; m) while the function value changes by an amount of order &#8486;(1). While this greatly simplifies the convergence analysis, it also implies that the hidden-layer representations are not being learned. In contrast, with the</p><p>which implies that the average movement of the feature maps is on the same order as the change in function value, and thus the hidden-layer representations as well as the NTK undergoes nontrivial movement during training. In Appendix C, we further justify the occurrence of feature learning using the framework developed in <ref type="bibr">[73]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">CONVERGENCE ANALYSIS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">MODELS WITH A FIXED EMBEDDING</head><p>To prove that the training loss converges to zero, we need a lower bound on the absolute value of Lt . Indeed, if G is positive definite, which depends on &#934; and the training data, we can establish one in the following way. First, as a simple case, if we use an activation function whose derivative's absolute value is uniformly bounded from below by a constant K &#963; &#8242; &gt; 0, such as linear, cubic or (smoothed) Leaky ReLU activations, we can derive a Polyak-Lojasiewicz (PL) condition <ref type="bibr">[58,</ref><ref type="bibr">45]</ref> from ( <ref type="formula">14</ref>) directly,</p><p>which implies L t &#8804; L 0 e -2&#265; 2 &#955;min(G)(K &#963; &#8242; ) 2 t , indicating that the training loss decays to 0 at a linear rate.</p><p>For more general choices of the activation function, a challenge is to guarantee that, heuristically speaking, for each a &#8712; [n], &#963; &#8242; h i (x a ) does not become near zero for too many i &#8712; [m] before the loss vanishes. To facilitate a finer-grained analysis, we need the following mild assumption on &#963;: Assumption 2. &#963; is Lipschitz with Lipschitz constant L &#963; , and there exists an open interval I = (I l , I r ) &#8838; R on which &#963; is differentiable and |&#963; &#8242; | is lower-bounded by some K &#963; &#8242; &gt; 0.</p><p>Intuitively, I is an active region of &#963;, within which the derivative has a magnitude bounded away from zero. This assumption is satisfied by the majority of activation functions in practice, including smooth ones such as tanh and sigmoid as well as non-smooth ones such as ReLU. Then, under the following initialization scheme, we prove a general result for models with a fixed embedding. Assumption 3. &#960; w is the D-dimensional standard Gaussian distribution, i.e., each W ij is sampled independently from a standard Gaussian distribution. Theorem 1 (Fixed embedding). Suppose that Assumptions 1, 2 and 3 are satisfied, and &#955; min (G) &gt; 0.</p><p>Then &#8707;&#265; 0 , r and</p><p>then with probability at least 1&#948;, it holds that &#8704;t &#8805; 0,</p><p>Here, &#265;0 , r and C depend on I, G min , G max , y , L &#963; and K &#963; &#8242; (but not on m, n, d, D, &#948;, or &#955; min (G)).</p><p>The result is proved in Appendix E, and below we briefly describe the intuition. A key to the proof is to guarantee that enough neurons remain in the active region throughout training. Specifically, with respect to each training data point (i.e. for each a &#8712; [n]), we can keep track of the proportion of neurons (among all i &#8712; [m]) for which h t i (x a ) &#8712; I. We show that if the proportion is large enough at initialization (shown by Lemma 3 in Appendix E.2 under Assumption 3), then it cannot drop dramatically without a simultaneous decrease of the loss value, as long as the c i 's are not too small in absolute value. This property of the dynamics is formalized in the following lemma: Lemma 1. Consider the dynamics of L t and h t i (x a ) i&#8712;[m],a&#8712; <ref type="bibr">[n]</ref> governed by <ref type="bibr">(11)</ref> and <ref type="bibr">(14)</ref>. Assume that &#955; min (G) &gt; 0, and &#8704;i &#8712; [m], |c i | = &#265; &gt; 0. Under Assumption 2, define</p><p>, &#8704;t &#8805; 0 and &#951;0 = min</p><p>) (18) Then &#8704;t &#8805; 0, there is</p><p>where &#954;(&#955; 1 , &#955; 2 ) = 9&#955;2(Ir-I l ) 2&#955;1K &#963; &#8242; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1">EXAMPLE: SHALLOW NEURAL NETWORKS WHEN n &#8804; d</head><p>In the case of shallow NNs under the MF scaling, G = G (0) &#8712; R n&#215;n , where</p><p>Thus, G (0) is positive definite if and only the training data set x 1 , ..., x n } &#8838; R d consists of linearlyindependent vectors, which is possible (and expected if the training data are sampled independently from some non-degenerate distribution) when n &#8804; d. In that case, Theorem 1 implies Corollary 1 (Shallow NN with n &#8804; d). Suppose that Assumptions 1, 2 and 3 are satisfied. If the training data are linearly-independent vectors, then under GF <ref type="bibr">(10)</ref> on the first-layer weights of the shallow NN, the training loss converges to zero at a linear rate.</p><p>While the assumption that n &#8804; d is restrictive, we note that existing convergence rate guarantees for the GD-type training of shallow NNs in the MF scaling need strong additional assumptions <ref type="bibr">[38]</ref>, modifications to the GD algorithm <ref type="bibr">[34,</ref><ref type="bibr">13,</ref><ref type="bibr">53]</ref>, or restrictions to certain special tasks <ref type="bibr">[43]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">MODELS WITH A HIGH-DIMENSIONAL RANDOM EMBEDDING</head><p>A clear limitation of Corollary 1 is that it is only applicable when n &#8804; d, since otherwise the Gram matrix G (0) cannot be positive definite. This motivates us to consider the use of a high-dimensional embedding &#934; to lift the effective input dimension. In particular, we focus on the scenario where D is large and &#934; is random. While the Gram matrix G in this case is also random, we only need that it concentrates around a deterministic and positive definite limit as D tends to infinity: Condition 1 (Concentration of G around a positive definite matrix). There exists a (deterministic) positive definite matrix &#7712; &#8712; R n&#215;n with least eigenvalue &#955; min ( &#7712;) &gt; 0 such that &#8704;&#948;, u &gt; 0,</p><p>Condition 1 is sufficient for us to apply Lemma 1 and obtain the following global convergence guarantee, which extends Theorem 1 to models with a high-dimensional random embedding. The proof is given in Appendix F. Theorem 2 (High-dimensional random embedding). Under Assumptions 1, 2, 3 and Condition 1, &#8707;&#265; 0 , r and</p><p>, then with probability at least 1&#948;, it holds that &#8704;t &#8805; 0, L t &#8804; L 0 e -r&#265; 2 &#955;min(G)t . (21) Here, &#265;0 , r and C depend on I, &#7712;min , &#7712;max , y , L &#963; and K &#963; &#8242; (but not m, n, d, D, &#948;, or &#955; min ( &#7712;)).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">EXAMPLE: PARTIALLY-TRAINED THREE-LAYER NEURAL NETWORKS (P-3L NNS)</head><p>Consider the P-3L NN model defined in <ref type="bibr">(4)</ref>. In this case, the Gram matrix is G (1) , defined by</p><p>If z 1 , ..., z m are sampled i.i.d. from a probability measure &#960; z on R d and fixed during training, then the limiting Gram matrix, denoted by &#7712;(1) &#8712; R n&#215;n , is given by</p><p>Thus, for the convergence result, the assumption we need on the limiting Gram matrix is Assumption 4. &#960; z is sub-Gaussian and the matrix &#7712;(1) , which depends on the choice of &#963; and the training set, is positive definite with &#955; min &#7712;(1) &gt; 0 and ( &#7712;(1) ) max &lt; &#8734;.</p><p>This assumption also plays an important role in the NTK analysis, and it is satisfied if, for example, &#960; z is the d-dimensional standard Gaussian distribution, no two data points are parallel, and &#963; is either the ReLU function <ref type="bibr">[19]</ref> or analytic and not a polynomial <ref type="bibr">[18]</ref>. When Assumption 4 is satisfied, as long as &#963; is Lipschitz, we can use standard concentration techniques to verify Condition 1. Thus, Theorem 2 implies that Theorem 3 (P-3L NN). Under Assumptions 1, 2, 3 and 4, &#8707;&#265; 0 , r, C 1 and</p><p>then with probability at least 1&#948;, it holds that &#8704;t &#8805; 0,</p><p>Here, &#265;0 , r, C 1 and C 2 depend on I, &#7712;(1) min , &#7712;(1) max , y , K &#963; &#8242; as well as the sub-Gaussian norm of &#181; z (but not on m, n, d, D, &#948; or &#955; min ( &#7712;(1) )).</p><p>The proof is given in Appendix G. Compared to Corollary 1 for shallow NNs, a highlight of Theorem 3 is that the requirement of n &#8804; d is no longer needed. This demonstrates an advantage of the highdimensional random embedding realized by the first hidden layer in the P-3L NN, thus illustrating a benefit of having both depth and width in NNs from the viewpoint of optimization. Compared to the NTK result <ref type="bibr">[18]</ref>, our analysis assumes the same level of over-parameterization, but crucially allows feature training to occur, which we discuss in Section 2.2 and support empirically in Section 4.3. Furthermore, by using a multi-layer NN with random and fixed weights as the high-dimensional random embedding, we extend the P-3L NN to a partially-trained L-layer NN model in Appendix H, for which similar convergence results can be proved for training its second-to-last layer via GF. </p><p>Additional results and details of the experiments are provided in Appendix I.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">EXPERIMENT 1: CONVERGENCE OF TRAINING WHEN FITTING RANDOM DATA</head><p>We train shallow NNs to fit a randomly labeled data set {(x 1 , y 1 ), ..., (x n , y n )} with d = 20. Specifically, we sample each x a i.i.d. with every entry sampled independently from a standard Gaussian distribution, and each y a i.i.d. uniformly on [-1 2 , 1 2 ] and independently from the x a 's. We see from Figure <ref type="figure">1</ref> that the convergence happens at a nearly linear rate when n = 20 and 40, and the rate decreases as n becomes larger. This is coherent with our theoretical result (Corollary 1), and interestingly also echoes a prior result that the convergence rate of optimizing a shallow NN using population loss can suffer from the curse of dimensionality <ref type="bibr">[71]</ref>, which implies a worsening of the convergence rate as the number of data points increases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">EXPERIMENT 2: BENEFIT OF INPUT EMBEDDING</head><p>We consider a model defined by ( <ref type="formula">2</ref>) and ( <ref type="formula">3</ref>) with d = 30 and &#934;(x) = vec(xx &#8890; ) &#8712; R d 2 , which we call a shallow NN augmented with quadratic embedding. We compare this model against a plain shallow NN (without the extra embedding), both with m = 8192, to fit a series of training sets with various sizes where the target y is given by another shallow NN augmented with quadratic embedding with m = 5. We see from Figure <ref type="figure">2</ref> that the augmented shallow NN achieves lower test error given the same number of training samples, demonstrating the benefit of a good embedding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">EXPERIMENT 3: FEATURE LEARNING V.S. LAZY TRAINING</head><p>We consider the P-3L NN model defined in (4) and ( <ref type="formula">5</ref>) with D = m (i.e. both hidden layers having the same width), and compare it with 3-layer NN models under NTK and MF scalings, as we define in Table <ref type="table">1</ref> based on prior literature <ref type="bibr">[37,</ref><ref type="bibr">4,</ref><ref type="bibr">52,</ref><ref type="bibr">62,</ref><ref type="bibr">23]</ref>, which undergo partial training in the same fashion. We adopt the data set used in <ref type="bibr">[69]</ref> (more details in Appendix I.3), and train the models by minimizing the unregularized squared loss for varying n's and m's.</p><p>First, we see from the top-left plot in Figure <ref type="figure">4</ref> that, consistently across different m, the training loss converges at a linear rate for the model under our scaling, which is coherent with Theorem 3. Second, we see from the second row that feature learning occurs in the model under our scaling but negligibly in the model under the NTK scaling, as expected <ref type="bibr">[16]</ref>. Note also that under the MF scaling, the feature maps h 1 (x), ..., h m (x) concentrate near 0 at initialization due to the small scaling, but gains diversity during training. Third, we see from Figure <ref type="figure">3</ref> that our model yields the smallest test errors out of all three, and in addition, as n grows the test error decreases faster under the MF scaling than under the NTK scaling, both indicating an advantage of feature learning compared to lazy training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSIONS AND LIMITATIONS</head><p>We consider a general type of models that includes shallow and partially-trained multi-layer NNs, which exhibits feature learning when trained via GF, and prove non-asymptotic global convergence guarantees that accommodates a general class of activation functions. For a randomly-initialized shallow NN in the MF scaling that is wide enough, we prove that by performing GF on the input-layer weights, the training loss converges to zero at a linear rate if the number of training data does not exceed the input dimension. For a randomly-initialized multi-layer NN with large widths, we prove  Each column corresponds to a different scaling of the P-3L NN model, as described in Table <ref type="table">1</ref>. Row 1: Evolution of training loss (solid curve) and test error (dashed curve) during training. Row 2: Distribution of the hidden-layer feature map (pre-activation) associated with two particular input data points. Each dot represents a different i, (i.e., neuron in the second hidden layer,) and the xand y-coordinates equal h i (x 1 ) and h i (x 2 ), respectively, where x 1 is an input from the training set and x 2 is an input from the test set.</p><p>that by performing GF on the weights in the second-to-last layer, the same result holds except there is no requirement on the input dimension. We also perform numerical experiments to demonstrate the advantage of feature learning in our partially-trained multi-layer NNs relative to their counterparts under the NTK scaling.</p><p>Our work focuses on the optimization rather than the approximation or generalization properties of NNs, which are also crucial to understand. In addition, as our current theoretical results on global convergence neglect the bias terms and assume that the last-layer weights are untrained, a more general version is left for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A ADDITIONAL NOTATIONS</head><p>&#8226; For a positive integer n, we let [n] denote the set {1, ..., n}.</p><p>&#8226; We use i, j (as subscripts) to index the neurons in the hidden layers, a, b (as subscripts or superscripts) to index different training data points, t (as a superscript) to denote the training time / time parameter in gradient flow.</p><p>&#8226; We write a for 1 </p><p>&#8704;i &#8712; [m] :</p><p>with weight parameters &#952;</p><p>] are initialized according to Xavier initialization, which means that we sample each &#952;</p><p>ij i.i.d. from N (0, 1 2m ), and each &#952;</p><p>(3) i i.i.d. from N (0, 1 m+1 ). If m &#8811; d, both N (0, 1 m+d ) and N (0, 1 m+1 ) can be approximated by N (0, 1 m ). Then, up to this approximation, by redefining</p><p>ij and z jk = &#8730; m&#952;</p><p>jk , we can write</p><p>&#8704;i &#8712; [m] :</p><p>and note that c i , W ij and z jk are all initialized i.i.d. of order O(1). In addition, if &#963; is homogeneous, this is then equivalent to (4) and ( <ref type="formula">5</ref>) when D = m.</p><p>this is equivalent to updating W ij according to</p><p>which justifies the m factor on the right-hand-side of (9).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C RELATIONSHIP TO THE MAXIMUM-UPDATE PARAMETERIZATION AND FEATURE LEARNING</head><p>Consider the partially-trained L-layer NN model defined in Section H in the case where D = m &#8811; d.</p><p>In the framework of abc-parameterization introduced in <ref type="bibr">[73]</ref>, our model corresponds to setting</p><p>Furthermore, as we explain in Appendix B, the appropriate learning rate scales linearly with m (as in ( <ref type="formula">9</ref>)), which corresponds to having c = -1</p><p>Meanwhile, the maximum-update (&#181;P) parameterization <ref type="bibr">[73]</ref> is characterized by setting</p><p>Recall the symmetry of abc-parameterization derived in <ref type="bibr">[73]</ref>, which states that one gets a different but equivalent abc-parameterization by setting</p><p>Since our parameterization can be obtained from the maximum-update parameterization by applying the transformation above with &#952; = 1 2 , they are equivalent in the function space. In particular, for our parameterization, the r parameter defined in <ref type="bibr">[73]</ref> can be computed as</p><p>Hence, according to <ref type="bibr">[73]</ref>, our parameterization exhibits feature learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D PROOF OF LEMMA 1</head><p>Since we assume that G is positive definite and</p><p>we can derive from ( <ref type="formula">14</ref>) that</p><p>Since I is an open interval, &#8707;&#958; &gt; 0 such that we can find a subinterval I 0 &#8838; I such that the distance between I 0 and the boundaries of I (if I is bounded on either side) is no less than &#958;, i.e.,</p><p>In particular, we can choose &#958; = 1 3 (I r -I l ) and I 0 = (I l + &#958;, I r&#958;). Then there is</p><p>and so</p><p>(40) Thus, we have</p><p>there is</p><p>Therefore,</p><p>Since &#8704;&#958; &#8712; R, there is</p><p>we derive that, &#8704;a &#8712;</p><p>where we set C 1 = &#955; max (G) (&#955; min (G))</p><p>-1 2 &#958; -1 &gt; 0 for simplicity. As a consequence,</p><p>Define</p><p>Note that at t = 0, there is &#951;t = min a&#8712;[n]</p><p>. Then, on one hand, we know from ( <ref type="formula">40</ref>) and ( <ref type="formula">47</ref>) that &#8704;t &#8805; 0,</p><p>Hence, <ref type="bibr">(41)</ref> implies that</p><p>On the other hand, by the definition of &#951;t ,</p><p>where we set G) for simplicity. Therefore, when &#951; t &gt; 0,</p><p>which implies that</p><p>and so &#8704;t &#8805; 0, 2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E PROOF OF THEOREM 1</head><p>To apply Lemma 1, we need two additional lemmas, which we will prove in Appendix E.1 and E.2. The first one guarantees that the loss value at initialization, L 0 , is upper-bounded with high probability:</p><p>Lemma 2. &#8704;&#948; &gt; 0, if m &#8805; &#8486; &#265;2 log n&#948; -1 G max / y 2 , then with probability at least 1&#948;, there is</p><p>The second one proves that &#951;0 is lower-bounded with high probability, which heuristically says that there is indeed a nontrivial proportion of neurons in the central part of the active region of &#963;, for every a &#8712; [n]:</p><p>2(K(I,Gmin,Gmax)) 2 , then with probability at least 1&#948;, there is</p><p>where</p><p>is a positive number that depends on I, &#955; 1 and &#955; 2 .</p><p>With these two lemmas, we deduce that &#8704;&#948; &gt; 0, if m &#8805; &#8486; (1 + &#265;2 / y 2 ) log n&#948; -1 , then with probability at least &#948;, there is &#8704;t &#8805; 0, 2</p><p>where K(I, G min , G max ) is defined as in Lemma 3. Therefore, if our choice of &#265; satisfies</p><p>then there is &#8704;t &#8805; 0,</p><p>in which case <ref type="bibr">(50)</ref> gives</p><p>which will allow us to finally conclude that </p><p>where M SG &gt; 0 is some absolute constant. Thus, by Hoeffding's inequality <ref type="bibr">[68]</ref>, &#8704;a &#8712; [n], &#8704;r &gt; 0,</p><p>where K is some absolute constant. Hence, by union bound,</p><p>then with probability at least 1 &#948;, there is</p><p>E.2 PROOF OF LEMMA 3</p><p>Since each W 0 ij are sampled i.i.d. from N (0, 1), we know that &#8704;a &#8712; [n], independently for each i &#8712; [m], h 0 i (x a ) follows a Gaussian distribution with mean 0 and variance G aa . Therefore,</p><p>Hence, by Hoeffding's inequality, &#8704;a &#8712; [n], &#8704;r &gt; 0,</p><p>&#8704;a &#8712; [n], choosing r = 1 2 &#960; (I 0 ; G aa ), we then get</p><p>and so by union bound</p><p>then we have G</p><p>ab -&#7712;(1) ab &#8805; u with probability at least 1&#948;. If we choose u = u &#8242; n and &#948; &#8242; = &#948; n 2 , then we get, if</p><p>Hence, with probability at least 1&#948;, we have</p><p>H GENERALIZATION TO DEEPER MODELS By setting &#934; to be the activations of the second-to-last hidden-layer of a multi-layer NN, we can obtain generalizations of the P-3L NN to deeper architectures. For example, in the feed-forward case, we can obtain the following partially-trained L-layer NN:</p><p>where W (1) , ..., W (L-3) &#8712; R D&#215;D and z 1 , ..., z D &#8712; R d are sampled randomly and fixed. This model can be written in the form of ( <ref type="formula">4</ref>) and ( <ref type="formula">5</ref>) with &#966; j (x) = &#963; h (L-2) j (x) . The corresponding Gram matrix is recursively defined and also appears in the NTK analysis <ref type="bibr">[18]</ref>. In particular, the results in <ref type="bibr">[18]</ref> imply that if &#963; is analytic and not a polynomial, then Condition 1 holds, and hence similar global convergence results can be obtained as corollaries of Theorem 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I FURTHER DETAILS OF THE NUMERICAL EXPERIMENTS</head><p>In our models, c i i&#8712;[m] is sampled i.i.d. from the Rademacher distribution &#181; c = 1 2 &#948; 1 + 1 2 &#948; -1 , z j j&#8712;[D] is sampled i.i.d. from N (0, I d ), and W ij i&#8712;[m],j&#8712; <ref type="bibr">[D]</ref> is initialized by sampling i.i.d. from N (0, 1). In the model under NTK scaling, we additionally symmetrize the model at initialization according to the strategy used in <ref type="bibr">[16]</ref> to ensure that the function value at initialization does not blow up when the width is large. We choose to train the models using 50000 steps of (full-batch) GD with step size &#948; = 1. When the test error is computed, we use a test set of size 500 generated by sampling i.i.d. from the same distribution as the training set.</p><p>The experiments are run with NVIDIA GPUs (1080ti and Titan RTX).  </p></div></body>
		</text>
</TEI>
