<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Quantifying the Benefit of Using Differentiable Learning over Tangent Kernels</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021 July</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10286827</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume">139</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Eran Malach</author><author>Pritish Kamath</author><author>Emmanuel Abbe</author><author>Nathan Srebro</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study the relative power of learning with gradient descent on differentiable models, such as neural networks, versus using the corresponding tangent kernels. We show that under certain conditions, gradient descent achieves small error only if a related tangent kernel method achieves a non-trivial advantage over random guessing (a.k.a. weak learning), though this advantage might be very small even when gradient descent can achieve arbitrarily high accuracy. Complementing this, we show that without these conditions, gradient descent can in fact learn with small error even when no kernel method, in particular using the tangent kernel, can achieve a non-trivial advantage over random guessing.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>A recent line of research seeks to understand Neural Networks through their kernel approximation, as given by the Neural Tangent Kernel (NTK, <ref type="bibr">Jacot et al., 2018)</ref>. The premise of the approach is that in certain regimes, the dynamics of training neural networks are essentially the same as those of its first order Taylor expansion at initialization, which in turn is captured by the Tangent Kernel at initialization. It is then possible to obtain convergence, global optimality and even generalization guarantees by studying the behaviour of training using the Tangent Kernel (e.g. <ref type="bibr">Li and Liang, 2018;</ref><ref type="bibr">Du et al., 2019b;</ref><ref type="bibr">Chizat et al., 2019b;</ref><ref type="bibr">Zou et al., 2020;</ref><ref type="bibr">Allen-Zhu et al., 2019;</ref><ref type="bibr">Arora et al., 2019b;</ref><ref type="bibr">Du et al., 2019a, and many others)</ref>. Some have also suggested using Tangent Kernel directly for training (e.g. <ref type="bibr">Arora et al., 2019a)</ref>, even suggesting it can sometimes outperform training by GD on the actual network <ref type="bibr">(Geiger et al., 2020)</ref>.</p><p>Can all the success of deep learning be explained using the NTK? This would imply that we can replace training by gradient descent on a non-convex model with a (potentially simpler to train, and certainly better understood) kernel method. Is anything learnable using gradient descent on a neural network or other differentiable model also learnable using a kernel method (i.e. using a kernelized linear model)?</p><p>This question was directly addressed by multiple authors, who showed examples where neural networks trained with gradient descent (or some variant thereof) provably outperform the best that can possibly be done using the tangent kernel or any linear or kernel method, under different settings and assumptions <ref type="bibr">(Yehudai and Shamir, 2019;</ref><ref type="bibr">Allen-Zhu and</ref><ref type="bibr">Li, 2019, 2020;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Daniely and Malach, 2020;</ref><ref type="bibr">Ghorbani et al., 2019b</ref><ref type="bibr">Ghorbani et al., , 2020))</ref>. However in these examples, while training the model with gradient descent performs better than using the NTK, the error of the NTK is still much better then baseline, with a significant "edge" over random guessing or using a constant, or null, predictor. That is, the NTK at least allows for "weak learning". In fact, in some of the constructions, the process of leveraging the "edge" of a linear or kernel learner and amplifying it is fairly explicit (e.g. <ref type="bibr">Allen-Zhu and</ref><ref type="bibr">Li, 2019, 2020)</ref>. The question we ask in this paper is:</p><p>Can gradient descent training on the actual deep (non-convex) model only amplify <ref type="bibr">(or "boost")</ref> the edge of the NTK, with the NTK required to have an edge in order to allow for learning in the first place? Or can differentiable learning succeed even when the NTK -or any other kernel-does not have a non-trivial edge? Can gradient descent succeed at "strong learning" only when the NTK achieves "weak learning"? Or is it possible for gradient descent to succeed even when the NTK is not able to achieve any significant edge? NTK at same Initialization NTK at alternate randomized Initialization NTK of arbitrary model or even an arbitrary Kernel GD with unbiased initialization (&#8704; x f &#952;0 (x) = 0) ensures small error &#9710; NTK edge &#8805; poly -1 (Thm. 1)</p><p>&#9710; NTK edge can be &lt; poly -1 while GD reaches 0 loss (Separation 1)</p><p>Edge with any kernel can be &lt; poly -1 while GD reaches 0 loss (Separation 2)</p><p>GD with arbitrary init. ensures small error Kernel (or alt init) can depend on input dist. D X NTK edge can be = 0 while GD reaches arb. low loss (Separation 3)</p><p>&#9710; NTK edge &#8805; poly -1 (Thm. 2)</p><p>&#9710; NTK edge can be &lt; poly -1 while GD reaches 0 loss (Separation 2)</p><p>Edge can be &lt; poly -1 while GD reaches 0 loss (Separation 2)</p><p>Dist-indep kernels edge with any kernel can be &lt; exp -1 while GD reaches arb. low loss (Separation 4)</p><p>Table <ref type="table">1</ref>: Our Results at a glance: What does learnability with Gradient Descent imply, and does not imply, on the edge over null prediction that the Neural Tangent Kernel (NTK), or some other kernel, must have? The results provide a complete picture (up to polynomial differences) of how large an edge is ensured in each scenario.</p><p>Our Contributions. The answer turns out to be subtle, and as we shall see, relies crucially on two important considerations: the unbiasedness of the initialization, and whether we can rely on input distribution knowledge for the initialization. We start, in Section 3, by providing our own example of a learning problem where Gradient Descent outperforms the NTK, and indeed any kernel method. But unlike the previous recent separating examples, where the NTK enjoys a considerable edge (constant edge, or even near zero error, but with a slower rate than GD), we show that the edge of the NTK, and indeed of any (poly-sized) kernel, can be arbitrarily close to zero while the edge of GD can be arbitrarily large. The edge of the NTK in this example is nonetheless vanishing at low rate, i.e., polynomially, and this leads us to ask whether the NTK must have at least a polynomial edge when GD succeeds.</p><p>In Theorem 1 of Section 4.1 we show that when using an unbiased initialization, that is, where the output of the model is 0 at initialization, then indeed the NTK must have a non-trivial edge (polynomial in the accuracy and scale of the model) in order for gradient descent to succeed.</p><p>The requirement that the initialization is unbiased turns out to be essential: In Section 5 we show an example where gradient descent on a model succeeds, from a biased initialization, even though no (reasonably sized) kernel method (let alone the NTK) can ensure a non-trivial edge.</p><p>But all is not lost with biased initialization. In Theorem 2 of Section 4.2 we show that at least for the square loss, if gradient descent succeeds from any (possibly biased) initialization, then we can construct some alternate random "initialization" such that the Tangent Kernel at this random initialization (i.e. in expectation over the randomness) does ensure a non-trivial edge. Importantly, this random initialization must depend on knowledge of the input distribution. Again, our example in Section 5 shows that this distribution-dependence is essential. This also implies a separation between problems learnable by gradient descent using an unbiased vs arbitrary initialization, emphasizing that the initialization being unbiased should not be taken for granted.</p><p>This subtle answer that we uncover is mapped in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Differentiable Learning and Tangent Kernels</head><p>We consider learning a predictor f : X &#8594; R over an input space X , so as to minimize its population loss L D (f ) := E (x,y)&#8764;D [&#8467;(f (x), y)] with respect to a source distribution D over X &#215; Y, where &#8467; : R &#215; Y &#8594; R &#8805;0 is a loss function. We denote by &#8467; &#8242; and &#8467; &#8242;&#8242; the derivatives of &#8467; w.r.t. its first argument. Some of our guarantees apply to any smooth loss, i.e., |&#8467; &#8242;&#8242; | := sup y,y |&#8467; &#8242;&#8242; ( y, y)| &lt; &#8734;, while others are specific to the square loss &#8467; sq ( y, y) = 1 2 ( yy)<ref type="foot">foot_1</ref> with binary labels Y = {-1, 1}. Our lower bounds and separation results are all stated for the square loss with binary labels (they could be adapted to other loss functions, but in order to demonstrate a separation, it is sufficient to provide a specific example). To understand whether a predictor gives any benefit, we discuss the error of a predictor compared to that of "null prediction", i.e. predicting 0 on all inputs. For the square loss with binary labels, the error of null prediction is L D (0) = 0.5, and so we refer to the improvement 0.5 -L D (f ) as the "edge" of the predictor f . That is, if L D (f ) = 0.5&#947;, we say that f has an edge of &#947; (over null prediction).</p><p>Differentiable Learning. We study learning by (approximate) gradient descent on a differentiable model, such as a neural network or other parametric model. More formally, a differentiable model of size p is a mapping f : R p &#215; X &#8594; R, denoted f &#952; (x), where &#952; &#8712; R p are the model parameters, or "weights", of the model, and the mapping is differentiable w.r.t. &#952;. We will control the scale of the model through a bound<ref type="foot">foot_0</ref> C 2</p><p>) on the norm of the gradients and function values 2 .</p><p>For a differentiable model f &#952; (x) and an initialization &#952; 0 &#8712; R p , gradient accuracy &#964; , and stepsize<ref type="foot">foot_2</ref> &#951;, &#964;approximate gradient descent training is given by iterates of the form:</p><p>The gradient estimates g t can be obtained by computing the gradient on an empirical sample of m training examples drawn from D, in which case we can ensure accuracy &#964; &#8733; C f / &#8730; m. And so, we can think of C 2 f /&#964; 2 as capturing the sample size, and when we refer to the relative accuracy &#964; /C f being polynomial, one might think of the sample size used to estimate the gradients being polynomial. But here we only assume the gradient estimates g t are good approximations of the true gradients of the population loss, and do not specifically refer to sampling. <ref type="foot">4</ref> We say that &#964; -approximate gradient descent with model f &#952; , initialization &#952; 0 , gradient accuracy &#964; and T steps ensures error &#949; on a distribution D if with any gradient estimates satisfying (2), the gradient descent iterates (1) lead to an iterate &#952; (T )  with population loss L D (f &#952; (T ) ) &#8804; &#949;.</p><p>The Tangent Kernel. Consider the first order Taylor expansion of a differentiable model about some &#952; * &#8712; R p :</p><p>where</p><p>which corresponds to a linear model with the feature map as in ( <ref type="formula">5</ref>), and thus can also be represented using the kernel:</p><p>In some regimes or model limits (e.g. <ref type="bibr">Daniely et al., 2016;</ref><ref type="bibr">Jacot et al., 2018;</ref><ref type="bibr">Chizat et al., 2019a)</ref>, the approximation (3) about the initialization remains valid throughout training, and so gradient descent on the actual model f &#952; (x) can be approximated by gradient descent on the linear model (4), i.e. a kernalized linear model specified by the tangent kernel NTK f &#952; * at initialization. One can then consider analyzing differentiable learning with the model f using this NTK approximation, or even replacing gradient descent on f with just using the NTK. How powerful can such an approximation be relative to the full power of differentiable learning? Can we expect that anything learnable with differentiable learning is also learnable under the NTK approximation?</p><p>To understand the power of a kernalized linear model with some kernel K, we should consider predictors realizable with bounded norm in the corresponding feature map (i.e. norm balls in the corresponding Reproducing Kernel Hilbert Space):</p><p>where</p><p>where K(x, x &#8242; ) = &#966;(x), &#966;(x &#8242; ) and f K is the K-RKHS-norm<ref type="foot">foot_4</ref> of f . Predictors in F(K, B) can be learned to within error &#949; with O(B 2 /&#949; 2 ) samples using O(B 2 /&#949; 2 ) steps of gradient descent on the kernalized linear model. And since any predictor learned in this way will be in F(K, B), showing that there is no low-error predictor in F(K, B) establishes the limits of what can be learned using K. We thus study the norm ball of the Tangent Kernel, which, slightly overloading notations, we denote:</p><p>Learning Problems. For a fixed source distribution D, there is always a trivial procedure for "learning" it, where a good predictor specific to D is hard-coded into the "procedure". E.g., we can use a kernel with a single feature corresponding to this hard coded predictor. Instead, in order to capture learning as inferring from data, and be able to state when this is not possible using some class of methods, e.g. using kernel methods, we refer to a "learning problem" as a family P of source distributions over X &#215; Y, and say that a learning procedure learns the problem to within error &#949; if for any distribution D &#8712; P the method learns a predictor with population error at most &#949;.</p><p>We consider learning both when the input distribution, i.e. the marginal distribution D X over X , is known (or fixed) and when it is unknown. Distribution dependent learning refers to learning when the input distribution is either fixed (i.e. all distributions D &#8712; P have the same marginal D X ), or when the model or kernel is allowed to be chosen based on the input distribution D X , as might be the case when using unlabeled examples to construct a kernel. In distribution independent learning, we seek a single model, or kernel, that ensures small error for all source distributions in P, even though they might have different marginals D X .</p><p>Remark. Typically, we would have a probabilistic quantifier (e.g. in expectation, or with high probability) over sampling the training set, but we define differentiable learning, and learning using a kernel, without any probabilistic quantifier: For differentiable learning, we required that for any D &#8712; P, gradient descent yields error at most &#949; using any sequence of gradient estimates satisfying (2) (this happens with high probability if we use m = O(B 2 /&#964; 2 ) to obtain the gradient estimates, but we do not get into this in our results). For kernel learning, we require that for any D &#8712; P there exists a predictor in F(K, B) with error at most &#949;, as these are the predictors that can be learned (with high probability) using the kernel method (but we do not get into the exact learning procedure).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Gradient Descent Outperforms the NTK</head><p>In this section, we exhibit a simple example of a learning problem for which (i) approximate gradient descent on a differentiable model of size p ensures arbitrarily small (in fact zero) error, whereas (ii) the tangent kernel for this model, or in fact any reasonably sized kernel, cannot ensure error better than 0.5&#947;, for arbitrarily small &#947;, where recall 0.5 is the error of the null prediction and &#947; depends polynomially on the parameters of the model and of gradient descent.</p><p>Several authors have already demonstrated a range of examples where gradient descent ensures smaller error than can be ensured by any reasonably sized kernel, including the tangent kernel <ref type="bibr">(Yehudai and Shamir, 2019;</ref><ref type="bibr">Ghorbani et al., 2019a,b;</ref><ref type="bibr">Allen-Zhu and</ref><ref type="bibr">Li, 2019, 2020;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Daniely and Malach, 2020)</ref>. In Section 6 and Table <ref type="table">2</ref> we survey these papers in detail and summarize the separations they establish. Here, we provide a concrete, self-contained and simple example for completeness, so that we can explicitly quantify the edge a kernel method might have. Our emphasis is on showing that the error for kernel methods is not just worse than gradient descent, but in fact not much better than null prediction-this is in contrast to prior separations, where the error possible using a kernel is either some constant between the error of null prediction and zero error, or more frequently, close to zero error, but not as close as the error attained by gradient descent (see Section 6 for details).</p><p>In our example, we also pay attention to whether the model's predictions at initialization are zero-a property that, as we will later see, plays a crucial role in our analysis.</p><p>The Learning Problem. We consider the problem of learning k-sparse parities over n biased bits, i.e. where the marginal distribution of each bit (i.e. coordinate) is non-uniform. In order to easily obtain lower bounds for any kernel (not just the tangent kernel), we let the input distribution be a mixture of independent biased bits and a uniform distribution, so that we can argue that no kernel can do well on the uniform component, and hence bound its error also on the mixture. The key is that when k is moderate, up to k = O(log n), due to the bits being biased, a linear predictor based on all the bits in the parity has enough of an edge to be detected. This does give the tangent kernel a small edge over a trivial predictor, but this is the best that can be done with the tangent kernel. However, in a differentiable model, once this linear predictor is picked up, its support can be leveraged to output a parity of these bits, instead of their sum, thereby obtaining zero error.</p><p>Formally, for integers 2 &#8804; k &#8804; n and for &#945; &#8712; (0, 1), we consider the "biased sparse parities" problem P bsp [n, k, &#945;] over X = {-1, 1} n and Y = {-1, 1}, and learning w.r.t. the square loss. The problem consists of n k distributions over X &#215; Y, each corresponding to a subset I &#8838; [n] with |I| = k. The distribution D I is defined by the following sampling procedure:</p><p>&#8882; Let D 0 be the uniform distribution over {-1, 1} n and let D 1 be the product distribution over {-1, 1} n with E x i = 1 2 . Sample<ref type="foot">foot_5</ref> x &#8764; D X := (1&#945;)D 0 + &#945;D 1 . &#8882; Set y &#8592; &#967; I (x) := i&#8712;I x i , the parity over the subset I.</p><p>The differentiable model. To learn P bsp [n, k, &#945;], we construct a differentiable model that is a combination of a linear predictor &#952;, x and a "selected-parity", which outputs the parity of the subset of bits indicated by the (essential) support of &#952;, that is, |&#952; i |&#8805;&#957; x i (for a suitable threshold &#957;). Importantly, both use the same weights &#952; (see Figure <ref type="figure">2</ref>). The weak edge of the linear predictor allows gradient descent to learn a parameter vector &#952; such that {i | |&#952; i | &#8805; &#957;} = I, and once this is learned, the "selected-parity" kicks in and outputs a perfect predictor.</p><p>Formally, we construct a differentiable model f &#952; (x) with &#952; &#8712; R n (i.e. size p = n) that behaves as follows</p><p>where &#8776; stands for the first order approximation of f &#952; about &#952; = 0. The behaviour (8) is the only property we need to show that approximate gradient descent learns P bsp [n, k, &#945;]: as formalized in Claim 3.1, a single step of gradient descent starting at &#952; (0) = 0 will leave &#952;</p><p>(1)</p><p>i &#8712; -2 n , 2 n for i &#8712; I, while increasing &#952;</p><p>(1) i &#8805; 3 n for i &#8712; I, thus yielding the correct parity. In what follows we show how to implement f &#952; as a continuous differentiable model with scale C f = O(n), and furthermore how this can implemented as a feed-forward neural network with piece-wise quadratic sigmoidal activations:</p><p>The model f &#952; , illustrated in Figure <ref type="figure">2</ref>, is defined below (where &#952; &#8226; x := (&#952; 1 x 1 , . . . , &#952; n x n )).</p><p>where</p><p>We visualize &#958;, which is a (coordinate-wise) soft implementation of the sign(&#8226;) function, in Figure <ref type="figure">1</ref>.</p><p>The intuition for G is as follows. In the relevant regime of &#952; &#8712;</p><p>In this regime of &#952;, we have S(s) = &#189; {s = 0} and H(s) = i:s i =0 s i . Thus,</p><p>n ] for all i, we have S(&#958;(&#952;&#8226;x)) = 0 and hence G(&#952;&#8226;x) = 0 for all x &#8712; {-1, 1} n . On the other hand, when &#952; i &#8712; [ 3 n , 5 n ] for some i, we have S(&#958;(&#952; &#8226; x)) = 1 and hence G(&#952; &#8226; x) = i : &#952; i &#8805; 3 n x i&#952;, x for all x &#8712; {-1, 1} n . This gives us (8), by noting that &#963; -1,1 -1,1 (z) &#8776; 2z at z &#8776; 0 (first order approximation) and &#963; -1,1 -1,1 (z) = sign(z) when |z| &#8805; 1. Furthermore, we show how to implement S, H (and hence G) as a neural network using the activation &#963;. This is based on observing that we can compute squares in a bounded range by exploiting the quadratic part of the nonlinearity, i.e., z 2 = 2r 2 (&#963;(z/2r) + &#963;(-z/2r)) for z &#8712; [-r, +r]. The n-term products in ( <ref type="formula">11</ref>) and ( <ref type="formula">12</ref>) can be implemented with subnetworks of depth O(log n) and width O(n) that recursively computes products of pairs,</p><p>we only need to compute squares in the range [-2, 2]. There is a slight issue in computing the final product because H(&#958;(&#952; &#8226; x))&#952;, x can be unbounded. However, &#952;, x &#8804; 5 in the relevant regime of &#952; and so we can correctly compute the final product in this regime; the product is not implemented correctly outside the relevant regime, but that doesn't affect the learning algorithm. The overall network thus has depth O(log n) and O(n) edges.</p><p>We can also calculate that for any i &#8712;</p><p>). The following Claim (proved in Appendix B.1) formalizes how a single step of gradient descent is sufficient for &#952; to be away from zero on I, and hence for the network to output the correct labels.</p><p>Claim 3.1. For any n, k and &#945; &#8712; (0, 1), and any D I &#8712; P bsp [n, k, &#945;], &#964; -approximate gradient descent on the model f of size p = n and C f = O(n) described above, with initialization &#952; 0 = 0 (at which &#8704; x f &#952; 0 (x) = 0), accuracy</p><p>Figure <ref type="figure">2</ref>: Schematic diagram for the construction of f &#952; , as used in Claim 3.1, with only trainable parameters being &#952; 1 , &#952; 2 , . . . , &#952; n . The sub-network G is a fixed module implementing the "selected-parity" function.</p><p>&#964; &#8804; &#945;/2 k , step size &#951; = 2 k /(&#945;n) and T = 1 step ensures L D I (f &#952; (T ) ) = 0. In particular, if k &#8804; log n then an accuracy of &#964; &#8804; &#945;/n is sufficient.</p><p>On the other hand, the next claim (proof in Appendix C.1.1), establishes that the tangent kernel at initialization only has a small edge over the null prediction.</p><p>Claim 3.2. The tangent kernel of the model f at &#952; 0 = 0 is the scaled linear kernel NTK f &#952; 0 (x, x &#8242; ) = 4 x, x &#8242; , and for any 2 &#8804; k &#8804; n, &#945; &#8712; (0, 1) and D I &#8712; P bsp [n, k, &#945;] the error with this kernel is inf h&#8712;NTK f &#952; 0</p><p>for all B &#8805; 0, where &#964; = &#945;/2 k as in Claim 3.1. On the other hand, &#8707;h &#8712; NTK f &#952; 0 (n) s.t.</p><p>We already see that gradient descent can succeed where the tangent kernel at initialization can only ensure an arbitrarily small edge over the error achieved by null prediction, L(0) = 1/2: Separation 1. For any &#947; &gt; 0, there exists a source distribution (with binary labels and w.r.t. the square loss), such that &#8882; [Gradient Descent] Using a differentiable model with p = 2 parameters, T = 1 steps, and accuracy &#964; C f = &#920;(&#947;), &#964; -approximate gradient descent can ensure zero squared-loss, but &#8882; [Tangent Kernel] The tangent kernel of the model at initialization does not ensure square loss lower than 1</p><p>Proof. Apply Claims 3.1 and 3.2 to the sole source distribution in</p><p>Even though the tangent kernel at the initialization used by gradient descent might have an arbitrarily small edge, one might ask whether better error can be ensured by the tangent kernel at some other "initialization" &#952;, or perhaps by the tangent kernel of some other model, or perhaps even some other kind of kernel<ref type="foot">foot_6</ref> . The following claim shows that this is not the case (proof in Appendix C.2.1).</p><p>Claim 3.3. For all &#945; &lt; 1/2, k, p, B, n, and any kernel K corresponding to a p-dimensional feature map, there exists</p><p>where L D I is w.r.t. the square loss, and note that |P bsp [n, k, &#945;]| = n k . And so, setting k = &#920;(log n) and &#945; = 1/ poly(n), we see that gradient descent with polynomial parameters can learn P bsp [n, k, &#945;], while no tangent kernel of a polynomial sized model (i.e. with p = poly(n)) can ensure better than an arbitrarily small polynomial edge: Separation 2. For any sequence &#947; n = 1/ poly(n), there exists a sequence of learning problems P n with fixed input distributions (with binary labels and w.r.t. the square loss,), such that &#8882; [Gradient Descent] for each n, using a differentiable model with p = n parameters, realizable by a neural network of depth O(log n) with O(n) edges (where some edges have trainable weights and some do not), T = 1 steps, polynomial accuracy &#964; C f = O &#947;n n 2 , and initialization &#952; 0 s.t. &#8704; x f &#952; 0 (x) = 0, &#964; -approximate gradient descent learns P n to zero loss; but &#8882; [Poly-Sized Kernels] no sequence of kernels K n corresponding to feature maps of dimension poly(n) (and hence no tangent kernel of a poly-sized differentiable models) can allow learning P n to square loss better than 1 2&#947; n for all n; and &#8882; [Arbitrary Kernel, Poly Norm] no sequence of kernels K n of any dimension can allow learning P n to square loss better than 1 2&#947; n for all n using predictors in <ref type="formula">1</ref>), and so for large enough n, the error lower bound in Claim 3.3 would be &gt; 1 2&#947; n .</p><p>Empirical Demonstration in Two-Layer Networks. While for ease of analysis we presented a fairly specific model, with many fixed (non-trainable) weights and only few trainable weights, we expect the same behaviour occurs also in more natural, but harder to analyze models. To verify this, we trained a two-layer fully-connected ReLU network on the source distribution D &#945; analyzed above, for n = 128 and k = 7. We observe that indeed when &#945; &gt; 0, and thus a linear predictor has at least some edge, gradient descent training succeeds in learning the sparse parity, while the best predictor in the Tangent Kernel cannot get error much better than 0.5. See Figure <ref type="figure">3</ref> for details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Ensuring the Tangent Kernel has an Edge</head><p>In the example of Section 3 we saw that the Tangent Kernel at initialization cannot ensure small error, and even though Gradient Descent finds a perfect predictor, the tanget kernel only has an arbitarily small edge over the null prediction. But this edge isn't zero: the second part of Claim 3.2 tells us that at least in the example of Section 3, the tangent kernel will have an edge, and this edge is polynomial (even linear) in the accuracy required of gradient descent. Is this always the case? We now turn to asking whether whenever gradient descent succeeds in ensuring small error, then the Tangent Kernel must indeed have an edge that is at least polynomial in the gradient descent parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">With Unbiased Initialization</head><p>We begin by considering situations where gradient descent succeeds starting at an initialization yielding zero predictions on all inputs, i.e. such that: 3) on best accuracy attainable by any kernel method with feature map with as many number of parameters as the NTK.</p><p>Following <ref type="bibr">Chizat et al. (2019a)</ref>, we refer to &#952; 0 satisfying ( <ref type="formula">14</ref>) as unbiased initializations. With an unbiased initialization, the Taylor expansion (3) does not have the zero-order, or "bias" term, there is no need for the first coordinate in the feature vector &#966; &#952; 0 , and the Tangent Kernel becomes</p><p>, simplifying the Neural Tangent Kernel analysis. <ref type="bibr">Chizat et al. (2019a)</ref>, <ref type="bibr">Woodworth et al. (2020)</ref> and others discuss how to construct generic unbiased initializations, namely by including pairs of units that cancel each other out at initialization, but quickly diverge during training and allow for meaningful learning. The initialization used in Claim 3.1 of Section 3 also satisfies (14).</p><p>In Theorem 1 below, we show that if gradient descent improves over the loss at initialization, then the Tangent Kernel must also have an edge over initialization, which for an unbiased initialization means an edge over the null prediction.</p><p>We can also considered a relaxation of the unbiasedness assumption ( <ref type="formula">14</ref>), and instead require only that the output of the network at initialization is very close to zero. This would be the case with some random initialization schemes which initialize the network randomly, but with small magnitude. As long as the deviation from unbiasedness is smaller than the edge guaranteed by Theorem 1, the Theorem would still ensure the tangent kernel has an edge over null prediction.</p><p>Theorem 1. For any smooth loss function &#8467;, differentiable model f , initialization &#952; 0 , and source distribution D, if &#964; -accurate gradient descent for T steps ensures error</p><p>In particular, if &#952; 0 is an unbiased initialization and &#949; &lt; L D (0), then</p><p>Proof. At a high level, we argue that if gradient descent is going to move at all, the gradient at initialization must be substantially non-zero, and hence move the model in some direction that improves over initialization. But since the tangent kernel approximates the true model close to the initialization, this improvement translates also to an improvement over initialization also in the Tangent Kernel. If the initialization is unbiased, initialization is at the null predictor, and this is an improvement over null predictions.</p><p>Let us consider first the general (not unbiased) case, and establish (15 <ref type="formula">15</ref>). Otherwise, we must have that &#8711; &#952; L D (f &#952; 0 ) 2 &#8805; &#964; , since otherwise g t can be 0 at every iteration of gradient-descent, forcing gradient-descent to return</p><p>For any &#945; &gt; 0, denoting w * = -&#945; &#8711;wL(h 0 ) &#8711;wL(h 0 )) , we have:</p><p>&#964; and the proof proceeds as above, noting that f &#952; 0 = h 0 = 0.</p><p>Separation 1 establishes that Theorem 1 is polynomially tight: it shows that despite gradient descent succeeding with an unbiased initialization, the tangent kernel at initialization has an edge of at most O(&#964; /C f ). This leaves a quadratic gap versus the guarantee of &#8486;(&#964; 2 /C 2 f ) in the Theorem (recalling |&#8467; &#8242;&#8242; | = 1 for the square loss), but establishes the polynomial relationship.</p><p>Theorem 1 is a strong guarantee, in that it holds for each source distribution separately. This immediately implies that if &#964; -approximate gradient descent on a differentiable model f with unbiased initialization &#952; 0 learns some family of distribution P to within small error, or even just to with some edge over the null prediction, then the Tangent Kernel at &#952; 0 also has an edge of at least &#964; 2 2|&#8467; &#8242;&#8242; |C 2 f over the null prediction. Note that this edge is polynomial in the algorithm parameters &#964; and C f , as is the required norm B. And so, if, for increasing problem sizes n, gradient descent allows for "polynomial learnability", in the sense of learning with polynomially increasing complexity, then the Tangent Kernel at least allows for "weak learning" with a polynomial edge (in all the parameters, and hence in the problem size n). Separation 2 shows that this relationship is tight, in that there are indeed problems where strong learning is possible with gradient descent with polynomial complexity, but the tanget kernel, or indeed any kernel of polynomial complexity, can only ensure polynomially weak learning.</p><p>Theorem 1 relies on the initialization being unbiased, or at the very least the predictions at initialization not being worse than the null predictions. Can we always ensure the initialization satisfies such a condition? And what happens if it does not? Might it then be possible for Gradient Descent to succeed even though the Tangent Kernel does not have an edge over the null prediction?</p><p>The example in Section 5 shows that, indeed, some learning problems might be learnable by gradient descent, but only using an initilization that is not unbiased. That is, we should not expect initializations to always be unbiased, nor can we expect to always be able to "fix" the initialization to be unbiased. And furthermore, the example shows that with an initilization that is not unbiased, the Tangent Kernel at initilization might not have a significant edge over the null predictor. But before seeing this example, let us ask: What can we ensure when the initialization is not unbiased?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">With Distribution Dependence</head><p>In Theorem 2 we show that, at least for the square loss, for an arbitrary initialization, even though the Tangent Kernel at initialization might not have an edge, we can construct a distribution W, which we can think of as an alternate "random initilization", such that the Tangent Kernel of the model at a random point &#952; &#8764; W drawn from this distribution, does have an edge over null prediction. But the catch is that this distribution depends not only on the true initialization &#952; 0 , but also on the input distribution D X . That is, the Tangent Kernel for which we are establishing an edge is distribution dependent. In Section 5 we will see that this distribution-dependence is unavoidable.</p><p>Theorem 2. For a differentiable model f , initialization &#952; 0 , stepsize &#951;, number of iterations T , and the square loss, given an input distribution D X , there exists a distribution W (that depends on D X ), supported on T + 1 parameter vectors in R p , such that for any source distribution D that agrees with D X and for which &#964; -approximate gradient descent ensures error</p><p>where</p><p>And so, the Average Tangent Kernel K = E &#952;&#8764;W NTK f &#952; has rank at most (T + 1)(p + 1) and inf</p><p>Proof. We first define a sequence of iterates &#952;t which corresponds to gradient descent descent on L D(f &#952; ), where D is defined as a distribution with marginal D X and y = 0 with probability one. These iterates are given by:</p><p>Let W be the uniform distribution over { &#952;0 , . . . , &#952;T }. At a high level, if the Tangent Kernel at all of these iterates does not have an edge, then gradient descent will not deviate from them, in the same way that in Theorem 1 the Tangent Kernel had to have an edge in order for gradient descent to move. But then gradient descent leads to f &#952;T &#8712; NTK f &#952;t , and so any edge gradient descent has is also obtained in this Tangent Kernel. In a sense, we are decomposing the gradient into two components: the gradient if all labels were zero, and the deviation from this gradient due to the labels not being zero. The second component is non-zero only when the tangent kernel has an edge. If the initialization is unbiased, the first component is zero, and so if the tangent kernel doesn't have an edge, gradient descent will not move at all. But if the initialization is not unbiased, the first component (corresponding to zero labels) might be non-zero even if the tangent kernel does not have an edge, and so gradient descent can explore additional iterates &#952;t , and any one of these having the edge could be sufficient for gradient descent succeeding. The key is that this sequence of iterates depends only on the input distribution D X , but not on the labels (since it is defined as the behaviour on zero labels).</p><p>Denote v (t) := E (x,y)&#8764;D y &#8226; &#8711; &#952; f &#952;(t) (x) , and let w (t) = -&#945;v (t) / v (t) , with &#945; to be set later s.t.</p><p>. Assume for contradiction that (18) does not hold (i.e. the Tangent Kernels do not have an edge on average), and so:</p><p>On the other hand, we have for every t:</p><p>C f we get that for every t:</p><p>We will now argue inductively that &#952;(t) is a valid choice for the iterate &#952; (t) , i.e. satisfying (1). For iteration t = 0, &#952;(0) = &#952; (0) = &#952; 0 . If up to iteration t we can have &#952;(t) = &#952; (t) then:</p><p>Now, from ( <ref type="formula">23</ref>), and since &#964; &#8805; (T + 1) <ref type="formula">2</ref>), and so &#952; (t+1) = &#952;(t+1) satisfies (1). And by induction, GD can return f &#952; (T ) = f &#952;(T ) . Since f &#952;(T ) &#8712; NTK f &#952;(T ) (B) we get:</p><p>This leads to a contradiction, thereby completing the proof of Theorem 2. For the average Kernel, note that it corresponds to a (T + 1)(p + 1) dimensional feature space which is the concatenation of the feature spaces of each NTK f &#952;(t) , and consider a predictor which is the concatenation of the best predictors for each kernel scaled by 1/(T + 1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">No Edge with the Tangent Kernel</head><p>As promised, we will now present an example establishing: 1. Some learning problems are learnable by gradient descent, but only with a biased initialization 2. Theorem 1 cannot be extended to biased initializations: Even if gradient descent, with a biased initialization, ensures small error, the Tangent Kernel at initialization might not have a significant edge. 3. Theorem 2 cannot be made distribution-independent: Even if gradient descent, with a biased initialization, ensures small error on a learning problem, there might not be any distribution-independent random "initialization" that yields a significant edge in a Tangent Kernel at a random draw from this "initialization".</p><p>The Learning Problem. This time we will consider learning parities of any cardinality (not necessarily sparse), but we will "leak" the support of the parity through the input distribution D X . For an integer n and &#945; &#8712; (0, 1), we consider the "leaky parities" problem</p><p>I : the first component D</p><p>I has labels y = &#967; I (x) corresponding to parity of bits in I, with a uniform marginal distribution over x. The second component has uniform random labels y independent of the inputs (i.e. the labels in this component are entirely uninformative), but the input is deterministic and set to x := x I where x I i = 1 for i &#8712; I and -1 for i / &#8712; I. Learning this problem is very easy: all a learner has to do is examine the marginals of each input coordinate of x i . Those in the support I will have E D I [x i ] = &#945; while when i &#8712; I we have E D I [x i ] = -&#945;. The marginals over x i thus completely leak the optimal predictor. But as we shall see, while gradient descent can leverage the marginals in this way, kernel methods (where the kernel is pre-determined and does not depend on the input distribution) fundamentally cannot.</p><p>More formally, D The differentiable model. To learn P lp [n, &#945;], we construct a differentiable model that is very similar to the one in Section 3, being a combination of a linear predictor and a "selected-parity", but where the linear component has a bias term, and is thus equal to -1 (rather then zero) at initialization. Formally, we construct f &#952; with &#952; &#8712; R n (i.e. size p = n) that behaves as follows</p><p>where &#8776; stands for the first order approximation of f &#952; about &#952; = 0, and 1 &#8712; R n is the all-1s vector. Again, ( <ref type="formula">27</ref>) is the only property we need, to show that approximate gradient descent learns P lp [n, &#945;]: at initialization, we output -1 for all examples, meaning the derivative of the loss is non-negative, and we generally want to try to increase the output of the model. How much we increase each coordinate &#952; i will depend on E[x i + 5 3 &#945;], which is 8/3 for i &#8712; I but only 2/3 for i &#8712; I, and the first step of gradient descent will separate between the coordinates inside and outside the support I, thus effectively identifying I. With an appropriate stepsize, the step will push &#952; into the regime covered by the second option in ( <ref type="formula">27</ref>), and the model will output the desired parity<ref type="foot">foot_7</ref> . The key ingredient here is that even though all the coordinates are uncorrelated with the labels, and thus also with residuals at an unbiased initialization, and the model of Section 3 will have zero gradient at initialization, we artificially make all the residuals at initialization non-positive, thus creating a correlation between each coordinate and the residual, which moves &#952; in a way that depends on the marginal D X (as in the proof of Theorem 2).</p><p>As with the model of Section 3, we can implement (27) as a continuous differentiable model with scale C f = O(n), through a slight modification of the architecture described in Equations ( <ref type="formula">9</ref>)-( <ref type="formula">13</ref>):</p><p>where S, H and &#958; are as defined in (11), ( <ref type="formula">12</ref>) and ( <ref type="formula">13</ref>). The first term in (28) is -1 at &#952; (0) = 0 and satisfies 0) , 1 = 0. The second term is essentially the same as (9), with only difference being the linear term &#952;, x is replaced with &#952;, x + 5 3 &#945;1 . Similar to the construction in Section 3, we get that (27) holds and that</p><p>). Furthermore, we can implement f &#952; using a similar feed-forward network of depth O(log n) and O(n) edges.</p><p>We prove the following claims in Appendix B.2 and Appendix C.1.2 respectively (cf. Claims 3.1 and 3.2).</p><p>Claim 5.1. For any n and &#945; &#8712; (0, 1), for any D I &#8712; P lp [n, &#945;], gradient descent on the model f of size p = n with C 2 f = O(n 2 ) described above, with initialization &#952; 0 = 0, accuracy &#964; = 4 3 &#945;, step size &#951; = 1 and T = 1 step ensures error L D I (f &#952; (T ) ) &#8804; &#945;.</p><p>Claim 5.2. The tangent kernel of the model f at &#952; 0 = 0 is the scaled linear kernel endowed with a bias term, NTK f &#952; 0 (x, x &#8242; ) = (1 + 100 9 &#945; 2 ) + 4 x, x &#8242; , and for any n &#8805; 2, &#945; &#8712; (0, 1), and any D I &#8712; P lp [n, &#945;], the error with this kernel is inf h&#8712;NTK f &#952; 0</p><p>We already see that in contrast to a situation where the initialization is unbiased, and so Theorem 1 ensures the tangent kernel has at least a polynomial edge, even if gradient descent succeeds, but with an initialization which is not unbiased, the tangent kernel at initialization might not have any edge at all, and so we cannot hope to remove the requirement of the initialization being unbiased from Theorem 1: Separation 3. For any &#949; &gt; 0, there exists a source distribution (with binary labels and w.r.t. the square loss), such that &#8882; [Gradient Descent] Using a differentiable model with p = 2 parameters, T = 1 steps, and accuracy &#964; C f = &#920;(&#949;), &#964; -approximate gradient descent ensures square loss of &#949;, but &#8882; [Tangent Kernel] The tangent kernel of the model at initialization cannot ensure square loss lower than 1 2 (which is the loss of null prediction).</p><p>Proof. Apply Claims 5.1 and 5.2 to the sole source distribution in</p><p>But what about the tangent kernel at a different "initialization", or of some other model, or even some other kernel altogether? The following Claim (proved in Appendix C.2.2) shows that no kernel can get a significant edge over the zero predictor unless the number of features and the norm are both exponentially large in n. In order to contrast with guarantee (18) of Theorem 2, we state the Claim also for learning using a random kernel, i.e. in expectation over an arbitrary distribution of kernels.</p><p>Claim 5.3. For all &#945; &#8712; (0, 1), p, B, n, and any randomized kernel K with rank(K) = p almost surely (i.e. a distribution over kernels, each of which corresponding to a p-dimensional feature map), there exists</p><p>where L D I is w.r.t. the square loss, and note that</p><p>In particular, we see that in contrast to the distribution-dependent situation of Theorem 2, where we could ensure a polynomial edge for the tangent kernel at a specified randomized "initialization", this is not possible in general, and we cannot hope to remove the dependence on the input distribution in Theorem 2.</p><p>Separation 4. For any sequence &#949; n = 1/ poly(n), there exists a sequence of learning problem P n (with binary labels and w.r.t. the square loss), such that &#8882; [Gradient Descent] for each n, using a differentiable model with p = n parameters, realizable by a neural network of depth O(log n) and O(n) edges (where some edges have trainable weights and some do not), T = 1 steps, and accuracy &#964; C f = O( &#949;n n ), &#964; -approximate gradient descent learns P n to square loss of &#949; n , but &#8882; [Poly-Sized Kernels] no sequence of (randomized) kernels K n corresponding to feature maps of dimension poly(n) (and hence no tangent kernel of a poly-sized differentiable models, even with randomized "initialization") can allow learning P n (in expectation) to square loss smaller than 1 2 -2 -&#8486;(n) for all n; and &#8882; [Arbitrary Kernel, Poly Norm] no sequence of (randomized) kernels K n of any dimension can allow learning P n (in expectation) to square loss smaller than 1 2 -2 -&#8486;(n) for all n using predictors in</p><p>Proof. Apply Claims 5.1 and 5.3 to</p><p>Since we are working over a domain X = {-1, +1} n or cardinality 2 n , we can always ensure an edge of (1&#945;)p/2 n+1 using indicator features on p of the 2 n inputs, thus allowing memorization of these tiny fraction of inputs. An edge of 2 -&#920;(n) is thus a "trivial" edge attained by this kind of memorization, and Claim 5.3 and Separation 4 establish the tangent kernel, or even any other kernel, cannot be significantly better.</p><p>We can also conclude that even though we saw that our learning problem is learnable with gradient descent, we cannot hope to learn it with an unbiased initialization, since this would imply existence of a kernel with a polynomial edge. We thus get the following separation between learnability with gradient descent using biased versus unbiased initialization (importantly, only for distribution independent learning).</p><p>Separation 5. For any sequence &#949; n = 1/ poly(n), there exists a sequence of learning problem P n (with binary labels and w.r.t. the square loss), such that &#8882; [GD with biased initialization] for each n, using a differentiable model with p = n parameters, realizable by a neural network of depth O(log n) and O(n) edges (where some edges have trainable weights and some do not), T = 1 step, and accuracy &#964; C f = O &#949;n n , and some (not unbiased) initialization, &#964; -approximate gradient descent learns P n to square loss &#949; n , but &#8882; [GD with unbiased initialization] It is not possible to ensure error better than 1 2 on P n , for all n, using &#964;approximate gradient descent starting with an unbiased initialization (as in Eq. ( <ref type="formula">14</ref>)), and any differentiable models of size p = poly(n) and accuracy &#964; /C f = 1/ poly(n), and any number of steps T .</p><p>Proof. Consider P n = P lp [n, &#945; = &#949; n ]. Claim 5.1 ensures learnability with a biased initialization. Suppose P n was also learnable by &#964; -approximate gradient descent with unbiased initialization, using a model f of size p n = poly(n), scale C f satisfying &#964; /C f = 1/ poly(n), to within any error better than 0.5, then Theorem 1 would imply the tangent kernel at initialization would ensure error &#8804; 1 2 -1 poly(n) . This is a fixed kernel that depends only on the model f &#952; , and corresponds to a feature map of dimension p n + 1. Since p n = poly(n), this would contradict Claim 5.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Survey of Separation Results</head><p>In Sections 3 and 5 we described learning problems for which gradient descent learning succeeds, but where the NTK, or indeed any other (reasonably sized) kernel, has either a very small edge, or even no edge at all. Our emphasis was on establishing not only that the NTK or kernel methods get worse error than gradient descent (as in previous separation results), but on bounding how close to the null prediction they must be (i.e. how small an edge they must have). Here we survey such prior separation results, understanding the relationships between them, and how they relate to the new separations from this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Survey Table. Table 2 summarizes prior results</head><p>showing separations between the error that can be ensured using gradient descent versus using kernel methods, as well as our Separations 1, 2, 3 and 4. All problems are over X = R n or {&#177;1} n with scalar labels (Y = R or {&#177;1}) and w.r.t. the squared loss, except Allen-Zhu and <ref type="bibr">Li (2019)</ref> which uses multiple outputs (y is a vector), and Daniely and Malach (2020) which is w.r.t. the hinge loss.</p><p>The middle set of green columns indicate what error can be ensured by running the indicated type of gradient descent variant on the indicated type of model. An error of "&#8594; 0" indicates that the learning problem only depends on n, and that for every n, gradient descent on a fixed model (that depends on n but not &#949;) can make the error &#949; arbitrarily small using poly(n/&#949;) samples and/or steps. An error of "&#8804; &#949;" indicates that the learning problem also depends on &#949;: for all n and &#949; &gt; 0, there exists a learning problem on which gradient descent ensures error &#8804; &#949; (but  for some &#948; &gt; 0 </p><p>Mix of parities and input distribution leaking the support (mixture ratio depends on &#949;)</p><p>See Section 6 for table description and notation, and Appendix A for further details.</p><p>Column I: 0 indicates unbiased initialization, &#8776; almost unbiased, &#215; not unbiased. Column D: "-" indicates lower bound for specific source distribution, "y" for learning problem with fixed input distribution DX , "x" for learning problem with variable input distributions.</p><p>(1) There is a discrepancy between the lower bound of <ref type="bibr">Yehudai and Shamir (2019)</ref> and the upper bounds on learning a Single ReLU with gradient descent (e.g. <ref type="bibr">Soltanolkotabi, 2017;</ref><ref type="bibr">Mei et al., 2018)</ref> in that the upper bounds are only without a bias (i.e. when b = 0) but the lower bound relies crucially on the bias b. Yehudai and Shamir thus do not establish a separation between differentiable learning and kernel methods. See Appendix A for discussion on possible remedies. (2) The gradient descent error &#949;(&#961;) and kernel error &#949; &#8242; (&#961;) depend on &#961; and on the specific source distribution (i.e. the quadratic target or the covariances of the classes), or rather sequence of source distributions as n &#8594; &#8734;. But <ref type="bibr">Ghorbani et al. (2019a)</ref> show that for non-degenerate distribution sequences, and any &#961; &lt; 1, the limit &#949;(&#961;) of the gradient descent error (as n &#8594; &#8734;) will be strictly lower than a lower bound &#949; &#8242; (&#961;) on the error of the NTK. (3) <ref type="bibr">Li et al. (2020)</ref> analyze a form of GD where large coordinates of the gradient are set to zeros. (4) Allen-Zhu and Li (2020) use non-standard but simple regularization.</p><p>(5) <ref type="bibr">Daniely and Malach (2020)</ref> use random initialization, but units are paired with complimentary initialization, as suggested by <ref type="bibr">Chizat et al. (2019a)</ref>, so as to ensure an unbiased initialization.</p><p>kernel methods incur larger error). The annotations 0/ &#8776; /&#215; under column "I" indicates: 0 if the initialization used is unbiased, or could be easily made unbiased; &#8776; if it is nearly unbiased (error at initialization close to null prediction); or &#215; if the initialization is far from null.</p><p>The right set of pink columns shows the lower bound on the error for the kernel, or class of kernels indicated, or under other restrictions ("dim" is the number of features or rank of the kernel, "norm &#8804; B" means restricting to F(K, B) as in ( <ref type="formula">6</ref>), NTK is the tangent kernel of the model used for gradient descent unless otherwise specified). In all cases, the error is normalized so that the error of null perdition is L(0) = 1 /2. For our separations, the error is given in terms of the edge (if any) over null prediction. In all prior separations, the error lower bound is bounded away from the null prediction (the edge is at least 0.1), and the actual error lower bound is indicated. The notation &#8805; &#949; &#8242; (c) &gt; 0 and &#8805; &#949; &#8242; (&#961;) &gt; &#949;(&#961;) indicates that for any choice of c or &#961;, there is some &#949; &#8242; (c) &gt; 0 or &#949; &#8242; (&#961;) &gt; &#949;(&#961;) which lower bounds the error.</p><p>The last column indicates the nature of the learning problem used to establish the lower bound, and thus whether the separation is distribution dependent or independent (in the sense introduced in Section 2): "-" indicates that the separation is for a specific source distribution (or sequence of source distribution with increasing input dimension n). This is the strongest form of separation, implying also distribution dependent separation, and is only possible by restricting to a specific kernel or kernel class. The notation "y" indicates the lower bound is for a learning problem P where the input distribution D X is fixed, and the source distributions in P vary only in the labeling function y(x) (which is deterministic in all such problems considered in the table). This yields a "distribution dependent" separation (i.e. a separation that holds even if the kernel is allowed to depend on hte input distribution). The notation "x" indicates that the lower bound is for a learning problem P where different source distributions D &#8712; P have varying input distributions D X , but we seek a kernel that would work well for all distributions in the learning problem. This yields a weaker "distribution independent" separation, as it only implies a separation if the kernel choice is not allowed to depend on the input distribution D X . It should be noted that these differences are due to differences in how the lower bound is established, and not differences in limitations or generality of the gradient descent guarantees (in all settings considered, the gradient descent guarantees are under severe restrictions on the input distribution, but the model used for training is independent of the specific input distribution meeting these restrictions).</p><p>Discussion. As can be seen from Table <ref type="table">2</ref>, in all the prior separation results, the lower bound on the error of kernel methods is bounded away from L(0), i.e. the upper bound on the edge is O(1). In fact, in all prior separations except for <ref type="bibr">Daniely and Malach (2020)</ref>, the kernel error goes to zero when the error of gradient descent goes to zero, it just goes to zero slower (Daniely and Malach establish a lower bound on the error that is bounded away from zero, but it still corresponds to a constant edge). In contrast, we exhibit a situation where the NTK, or any kernel method, has only vanishing (or zero) edge, and study how quickly this edge goes to zero. In order to obtain a separation where the edge of the NTK at initialization is zero, and the edge of any kernel is exponentially small, we construct a learning problem where the input distribution is not fixed, and where gradient descent learning is possible only with a biased initialization-this is necessary otherwise Theorems 1 and 2 would tell us the edge must be at least polynomial. It is interesting to note that none of the prior constructions are of this nature: the only prior construction that relies on variable input distributions is that of <ref type="bibr">Daniely and Malach (2020)</ref>, but they use an unbiased initialization.</p><p>The different separation results differ in the type of kernels for which they establish lower bounds. Many of the separation results (including our Separations 2 and 4) establish lower bounds for any kernel corresponding to a poly-dimensional feature space, and thus for the NTK of any poly-sized model. When considering the square loss, such lower bounds also imply lower bounds for any kernel (of any dimensionality) using poly-norm predictors, as in our Claims 3.3 and 5.3 (also see Lemma 5 in <ref type="bibr">Kamath et al., 2020)</ref>. Since <ref type="bibr">Daniely and Malach (2020)</ref> considered the hinge loss, they required that both the dimension and the norm be polynomial (see <ref type="bibr">Kamath et al. (2020)</ref> for a discussion on kernel lower bounds for the squared norm versus the hinge loss. In yet unpublished followup work, we obtain lower bounds on the hinge loss also without a norm bound).</p><p>As discussed in Section 2, in order to establish lower bounds with respect to any kernel, one necessarily needs to consider an entire learning problem (with multiple different possibly labelings) rather than a specific source distribution. <ref type="bibr">Ghorbani et al. (2019a,b)</ref> take a different approach: they restrict the kernels being considered to NTKs of a specific model, NTKs of a class of models, or general rotationally invariant kernels, and so are able to establish separations for specific source distributions (between differentiable learning and using one of these specific kernels). These separations are more similar to our Separations 1 and 3, which are specific to the NTK of the model being used (theirs are more general), but their separations holds only for large input dimensions n, while ours holds already for dimension n = 2.</p><p>A deficiency of our Separations 3 and 4, shared also by <ref type="bibr">Li et al. (2020)</ref>; Allen-Zhu and Li (2019); Daniely and Malach (2020) and <ref type="bibr">Ghorbani et al. (2019a)</ref>, is that we do not demonstrate a fixed learning problem where the gradient descent error goes to zero. Instead, for every &#949; &gt; 0, we construct a different problem where the gradient descent error is &#8804; &#949; (while the kernel error is high; close to that of the null predictor). <ref type="bibr">Ghorbani et al. (2019b)</ref> and Allen-Zhu and Li (2020), as well as our Separations 1 and 2, do use learning problem where gradient descent can get arbitrarily small error. It would be interesting to strengthen Separations 3 and 4 so that gradient descent could converge to zero error (with a fixed learning problem and model).</p><p>The separation results also differ in the form of gradient descent they analyze. <ref type="bibr">Ghorbani et al. (2019a)</ref> and <ref type="bibr">Daniely and Malach (2020)</ref> analyze gradient descent or gradient flow on the population, which is the limit of GD/SGD when the number of samples and/or iterations goes to infinity (although how quickly these should increase is not analyzed). But most analysis is for batch or stochastic gradient descent (which is what would be used in practice) with polynomial samples and iterations <ref type="bibr">(Soltanolkotabi, 2017;</ref><ref type="bibr">Mei et al., 2018;</ref><ref type="bibr">Allen-Zhu and Li, 2019)</ref>, perhaps with slight modifications or regularization <ref type="bibr">(Li et al., 2020;</ref><ref type="bibr">Allen-Zhu and Li, 2020)</ref>. Our analyses is for &#964;approximate gradient descent with polynomial accuracy &#964; , which subsumes batch (or mini-batch) gradient descent with polynomially many samples (or batch size). But our analysis is perhaps odd and unnatural in that it relies on a single step with a large stepsize, rather than allowing the number of steps to increase and the stepsize to decrease. It should be possible, though is technically much more complicated, to extend our analysis to cover also gradient descent with any stepsize smaller than the one we use, and thus also gradient flow. This would establish strong separation also based on a more restrictive definition of differentiable learning, which requires robustness with respect to the stepsize.</p><p>Finally, the models we use are chosen to be simple to analyze, but they are perhaps not "natural". We do show how they can be implemented with a sigmoidal network, but this is a network with a very specific architecture, or alternatively a fully connected network with very specific initialization and where some of the weights are fixed rather than trained. Showing similarly strong separations with more natural architectures and random initialization would be desirable. The empirical demonstration in Figure <ref type="figure">3</ref> indicates this should be possible (though perhaps technically involved). Allen-Zhu and <ref type="bibr">Li (2019</ref><ref type="bibr">Li ( , 2020</ref>) also use a specialized residual architecture (though perhaps not as specific as ours), while the others <ref type="bibr">(Soltanolkotabi, 2017;</ref><ref type="bibr">Mei et al., 2018;</ref><ref type="bibr">Ghorbani et al., 2019a;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Daniely and Malach, 2020)</ref> use fairly benign networks and initialization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion and Discussion</head><p>With the study of Neural Tangent Kernels increasing in popularity, both as an analysis and methodological approach, it important to understand the limits of the relationship between the Tangent Kernel approximation and the true differentiable model. Furthermore, the notion of "gradual" learning of deep models, where we learn progressively more complex models, or more layers, and so the success of deep learning rests on being able to make progress even with simpler, e.g. linear models, is an appealing approach to understanding deep learning. Indeed, when we first asked ourselves whether the tangent kernel must always have an edge in order for gradient descent to succeed, and we sought to quantify how large this edge must be, we were guided also by understanding the "gradual" nature of deep learning. We were surprised to discover that in fact, with biased initialization, deep learning can succeed even without the tangent kernel having a significant edge.</p><p>Our results also highlight the importance of the distinction between distribution dependent and independent learning, and between biased and unbiased initialization. The gap between distribution dependent and independent learning relates to kernel (i.e. linear) methods inherently not being able to leverage information in D X : success or failure is based on whether y|x is well represented by a predictor in F(K, B), and has little to do with the marginal over x. In contrast, gradient descent on a non-linear model, could behave very differently depending on D X , as we also see empirically in the experiment in Figure <ref type="figure">3</ref>. Perhaps even more surprising is the role of the bias of the initialization. It might seem like a benign property, and that we should always be able to initialize with zero, or nearly-zero predictions, or at least at &#952; 0 that is not much worse than null, or perhaps correct for the bias as in <ref type="bibr">Chizat et al. (2019a)</ref>. But we show that at least for distribution-independent learning, this is not a benign property at all: for some problems we must use biased initialization (Separation 5). This observation may be of independent interest, beyond the role it plays in understanding the Neural Tangent Kernel.</p><p>The learning problems and models we used to demonstrate the separation results are artificially extreme, so as to push the separation to the limit, and allow easy analytical study. But we believe they do capture ways in which gradient descent is more powerful than kernel methods. In the example of Section 3, gradient descent starts by selecting a few "simple features" (the k &#8810; n relevant coordinates), based on simple correlations. But unlike kernel methods, gradient descent is then able to use these features in more complex ways. We see this happening also empirically in Figure <ref type="figure">3</ref> with a straightforward two-layer ReLU network, where gradient descent is able to succeed in learning the complex parity function, once there is enough signal to easily identify the few relevant coordinates.</p><p>In the example of Section 5, we see how gradient descent is also able to pick up on structure in the input (unlabeled data) distribution, in a way that kernel methods are fundamentally unable to.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Prior work on Separations between Differentiable Learning and Kernels</head><p>We provide more details on the prior work presented in Table <ref type="table">2</ref> that demonstrate learning tasks where gradientdescent based learning can ensure smaller error than any reasonably sized kernel method.</p><p>In the summary below, whenever a poly(&#8226;) terms appears under "Learnability with Gradient Descent", it is always for a particular poly(&#8226;) that has been explicitly or implicitly specified in the corresponding reference. On the other hand, for every poly(&#8226;) term that appears under "Non-learnability with any Kernel Method", it is always meant to hold for any poly(&#8226;).</p><p>1. <ref type="bibr">Yehudai and Shamir (2019)</ref>; <ref type="bibr">Ghorbani et al. (2019b)</ref> : Class of ReLU neurons over Gaussian inputs &#8882; Learning Problem: With X = R n and Y = R, a source distribution is given by D X = N (0, I n ) and y = [ w, x + b] + . To make the result comparable to others, in terms of the value of the loss, we consider a normalization of w and b such that the loss of the null predictor L(0) = 1/2. In the original formulation of <ref type="bibr">Yehudai and Shamir (2019)</ref>, w, b are chosen s.t. w , |b| &#8804; O(n 4 ), so that the value y is bounded by O(n 4 ), and the loss of the null predictor is O(n 8 ) (due to the square loss). Therefore, in order to have L(0) = &#920;(1), the values of w, b are scaled down by 1/n 4 , and the lower-bound on the loss is &#8486;(1/n 8 ) (and not &#8486;(1) as stated in the paper). &#8882; Learnability with Gradient Descent: Soltanolkotabi (2017); <ref type="bibr">Mei et al. (2018)</ref> showed that, in the case when b = 0, projected batch gradient descent over the ReLU model with poly(n/&#949;) samples with zero initialization (i.e., unbiased initialization with w (0) = 0), can ensure square loss at most &#949; with O(poly(n) log(1/&#949;)) steps. Importantly, this holds for any &#949; &gt; 0. In the analysis of the positive result, initializing from zero is important, since in this case the first step of gradient-descent already captures the "direction" of the target neuron. &#8882; Non-learnability with any Kernel Method: <ref type="bibr">Yehudai and Shamir (2019)</ref> show that any randomized feature map of dimension at most 2 c 1 n with coefficients of the predictors bounded by 2 c 1 n , for some universal constant c 1 , must incur a square loss of at least &#8486;(1/n 8 ); this is a scaled version of their stated result, which was for w = n 3 and |b| &#8804; 6n 4 + 1. Subsequently, <ref type="bibr">Kamath et al. (2020)</ref> showed the same bound can be obtained without the restriction on coefficients of the predictors, leading to the lower bound as presented in the Table <ref type="table">2</ref>. Both of these lower bounds crucially rely on a non-zero bias term b. Because the upper bounds are specific to the case b = 0 where there is no bias term, while the lower bound relies on a bias term b = 0, the analysis of Yehudai and Shamir does not establish a separation between differentiable learning and kernel methods. It was communicated to use by Gilad Yehudai that ongoing work with Ohad Shaimr and Gal Vardi indicates that gradient descent also succeeds in the presence of a non-zero bias term, which would yield a separation. Alternatively, in our own yet unpublished work, we observed it is also possible to obtain a lower bound for ReLU without the bias term, which also yields a separation. <ref type="bibr">Ghorbani et al. (2019b)</ref> show that for any constant c 2 , any kernel method that is either (i) a rotationally invariant kernel using at most n c 2 samples or (ii) NTK of a depth-2 ReLU Net with at most n c 2 units must incur a square loss of at least &#949; &#8242; (c 2 ), which is a positive constant that depends only on c 2 . This result holds even for a fixed source distribution D, that is, a fixed choice of w and D X = N (0, I n ). This result holds with probability approaching 1, as the dimension n and the number of features grow to infinity.</p><p>2. <ref type="bibr">Ghorbani et al. (2019a)</ref> : (i) Convex Quadratic Functions, (ii) Mixture of Gaussians (Binary Labels) &#8882; Learning Problems: Two learning problems are considered here; both w.r.t. square loss.</p><p>(i) With X = R n and Y = R, a source distribution is given by Gaussian input marginals D X = N (0, I n ) and y = b 0 + x T Bx for B 0.</p><p>(ii) With X = R n and Y = {-1, 1}, a source distribution is specified by (&#931; (1) , &#931; (-1) ), where y &#8764; U ({-1, 1}), and x | y &#8764; N (0, &#931; (y) ). &#8882; Learnability with Gradient Descent: For a depth-2 network with quadratic activations, with N units and any random initialization that is absolutely continuous w.r.t. Lebesgue measure (in particular, the initialization can be chosen to nearly unbiased), it is shown that in the asymptotic regime as n, N &#8594; &#8734; with &#961; = N/n &lt; 1, gradient flow w.r.t. population loss ensures square loss that approaches &#949;(&#961;), which is a closed form expression that also depends on the specific source distribution (i.e. the quadratic target or the covariances of the classes), or rather sequence of source distributions as n &#8594; &#8734;. The requirement on the initialization is needed since there is a measure-zero set of initializations that converge to a saddle point. Depending on the problem, the saddle points can be escaped also from an unbiased initialization. &#8882; Non-learnability with any Kernel Method: It is shown in the same asymptotic regime of n, N &#8594; &#8734; with &#961; = N/n &lt; 1, that the NTK at initialization of the same model incurs a square loss of &#949; &#8242; (&#961;), which is a closed form expression that also depends on the sequence of source distributions as n &#8594; &#8734;. It is shown that for non-degenerate distribution sequences, and any &#961; &lt; 1, that &#949; &#8242; (&#961;) &gt; &#949;(&#961;).</p><p>3. <ref type="bibr">Li et al. (2020)</ref> : A depth-2 Net, with abs. value activations and restricted weights, over Gaussian inputs &#8882; Learning Problem: With X = R n and Y = R, a source distribution is given by D X = N (0, I n ) and y = n i=1 a i | w i , x |, which is a two layer network with absolute value activations, and additional constraints that w i 's are orthonormal and a i &#8712; [ 1 &#954;n , &#954; n ] for some &#954; &gt; 1. &#8882; Learnability with Gradient Descent: It is shown that there exists a depth-2 ReLU network with poly(n) units and a random normal initialization that is not unbiased, a type of truncated batch gradient descent (where large coordinates of the gradient are set to zero) using poly(n) samples, is shown to ensure square loss at most &#949; := 1/n 1+&#948; for a constant &#948; &#8712; (0, 0.01) (that depends on &#954;) with poly(n) steps. Notably, this procedure is not shown to ensure an arbitrarily small error. &#8882; Non-learnability with any Kernel Method: It is shown that any kernel method with at most poly(n) features (for any poly(n)) must incur a square loss of at least &#8486;(1/n) &#8805; &#949; 1-&#948;/2 . 4. Allen-Zhu and Li (2019) : Sparse Variable Selection + Parity (implementable as a depth-2 ResNet)</p><p>We can interpret this as W x selecting a k sized subset of input coordinates, and G computes the parity of that subset, in each of the k coordinates. &#8882; Learnability with Gradient Descent: For an overparameterized depth-2 ResNet with poly(n/&#949;) units and a random normal initialization (that has a high variance, leading to a biased initialization), stochastic gradient descent is shown to ensure square loss of &#949; &#8804; &#945; 3.9 with poly(n, 1/&#949;) steps. Notably, it is not shown to ensure an arbitrarily small error. &#8882; Non-learnability with any Kernel Method: It is shown that any kernel method with at most poly(n) (for any poly(n)) features must incur a square loss of at least &#8486;(&#949; 0.52 ) &#8805; &#8486;(&#945; 2 ).</p><p>Remark: Note that the learning problem here depends on the choice of &#949;. That is, for every n and &#949; &gt; 0, there is a learning problem where stochastic gradient descent ensures square loss &#8804; &#949; whereas no kernel method of dimension at most poly(n) can achieve a loss smaller than &#949; 0.52 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Allen-Zhu and Li (2020) :</head><p>A depth-L Net with quadratic activations &#8882; Learning Problem: With X = R n and Y = R, a source distribution is given by D X = N (0, I n ) and y = N &#952; (x), where N &#952; is a depth-L network with quadratic activations that has additional constraints on presence of skip connections, as well as constraints on the weights, that is referred to as the "information gap". L can be taken to be any parameter that is o(log log n). &#8882; Learnability with Gradient Descent: For an overparameterized depth-L Net with quadratic activations, with identical layer structure as N &#952; , with poly(n) units and zero initialization (unbiased), regularized stochastic gradient descent (with a non-standard regularizer) is shown to ensure square loss &#949;, with poly(n/&#949;) number of steps. Importantly, this holds for any &#949; &gt; 0. &#8882; Non-learnability with any Kernel Method: It is shown that any kernel method of dimension at most poly(n) (for any poly(n)) must incur a square loss of at least &#8486;(1/n 0.01 ).</p><p>6. Daniely and Malach (2020) : Sparse parities over a mixture of uniform and leaky marginals &#8882; Learning Problem: With X = {-1, 1} n and Y = {-1, 1}, the learning problem is similar to P lp [n, &#945;] that we consider in Section 5 with &#945; = 1/2. There are two differences: (i) The label under D</p><p>(1)</p><p>I is given by the parity, instead of being random and (ii) the sparsity of the parity is fixed to be k, which can be taken to be k = &#920;( 10 &#8730; n). &#8882; Learnability with Gradient Descent: For a particular &#949; = 1/ poly(n), a depth-2 net with ReLU6 activation with poly(n) units and an unbiased initialization (obtained by pairing the units with complimentary initialization, as suggested by <ref type="bibr">Chizat et al. (2019a)</ref>), a regularized gradient descent (standard &#8467; 2 regularization) over the population loss is shown to ensure hinge loss of &#949;, with poly(n) steps. Notably, it is not shown to ensure an arbitrarily small error. &#8882; Non-learnability with any Kernel Method: It is shown that any kernel method of dimension at most poly(n)</p><p>and norm at most poly(n) must incur a hinge loss of at least 1/3 (recall that hinge loss is &#8467;( y, y) := max {1yy, 0}; and so that L D (0) = 1 for any distribution D with &#177;1 labels). </p><p>Proof of Claim 3.1. We only rely on Equation <ref type="formula">8</ref>, which gives us that at &#952; (0) = 0, we have for all</p><p>and hence &#8711; &#952; L D I (f &#952; (0) ) is given by</p><p>Thus, for all j &#8712; I we have</p><p>and for all j / &#8712; I we have:</p><p>With step size &#951; = 2 k /(&#945;n), we have that after the first gradient step, (i) for all j &#8712; I, &#952;</p><p>(1)</p><p>From Equation <ref type="formula">8</ref>we get f &#952; (1) (x) = i&#8712;I x i , thereby concluding the proof.</p><p>B.2 Proof from Section 5 Claim 5.1. For any n and &#945; &#8712; (0, 1), for any D I &#8712; P lp [n, &#945;], gradient descent on the model f of size p = n with C 2 f = O(n 2 ) described above, with initialization &#952; 0 = 0, accuracy &#964; = 4 3 &#945;, step size &#951; = 1 and T = 1 step ensures error L D I (f &#952; (T ) ) &#8804; &#945;.</p><p>Proof of Claim 5.1. We only rely on Equation <ref type="formula">27</ref>, which gives us that at &#952; (0) = 0, we have for all x &#8712; {-1, 1} n that</p><p>and hence &#8711; &#952; L D I (f &#952; (0) ) is given by</p><p>) where we get that E (x,y)&#8764;D (0)</p><p>With gradient accuracy of &#964; = 4 3 &#945;, we get that any valid gradient estimate g must satisfy</p><p>From Equation <ref type="formula">27</ref>, we get that f &#952; (1) (x) = i&#8712;I x i and it is easy to see that L D I (f &#952; (1) ) = &#945;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Lower Bounds on Kernel Methods</head><p>C.1 Lower Bounds for Tangent Kernels C.1.1 NTK Edge in Section 3 Claim 3.2. The tangent kernel of the model f at &#952; 0 = 0 is the scaled linear kernel NTK f &#952; 0 (x, x &#8242; ) = 4 x, x &#8242; , and for any 2 &#8804; k &#8804; n, &#945; &#8712; (0, 1) and D I &#8712; P bsp [n, k, &#945;] the error with this kernel is inf h&#8712;NTK f &#952; 0</p><p>for all B &#8805; 0, where &#964; = &#945;/2 k as in Claim 3.1. On the other hand, &#8707;h &#8712; NTK f &#952; 0 (n) s.t.</p><p>Proof. It follows from ( <ref type="formula">30</ref>) and ( <ref type="formula">31</ref>) that the tangent feature map is &#966; &#952; 0 (x) = x, establishing that the NTK is the standard linear kernel, and that predictors in the NTK are just linear in x. For uniform inputs (i.e. from D 0 ), no linear predictor h w (x) = w, x is correlated with the parity if k &#8805; 2 bits, and so the expected square loss of any such predictor on D 0 will be at least 1 2 , yielding L D I (h w ) &#8805; (1&#945;) 1 2 = 1 2 -&#945; 2 . To obtain an upper bound on the error (lower bound on the edge), consider a linear predictor h(x) = c &#8226; Z where Z = i&#8712;I x i . By calculating E[Z 2 ] = k + &#945;(k 2k)/4 and E[Zy] = &#945;k/2 k-1 , and verifying the optimal scaling is c = E[Zy]/E[Z 2 ], we get that with this value of c, L D I (h) = 1 2&#947;, where</p><p>The actual optimal linear predictor is of the form h(x) = c i&#8712;I x i + b i &#8712;I x i , and has a slightly better edge.</p><p>C.1.2 NTK Edge in Section 5</p><p>Claim 5.2. The tangent kernel of the model f at &#952; 0 = 0 is the scaled linear kernel endowed with a bias term, NTK f &#952; 0 (x, x &#8242; ) = (1 + 100 9 &#945; 2 ) + 4 x, x &#8242; , and for any n &#8805; 2, &#945; &#8712; (0, 1), and any D I &#8712; P lp [n, &#945;], the error with this kernel is inf h&#8712;NTK f &#952; 0 (B) L D I (h) = L D I (0) = 1 2 for any B &#8805; 0.</p><p>Proof. This follows from noting that for any h w,b &#8712; NTK f &#952; 0 (B) of the form h w,b (x) = w, x + b, it holds that E (x,y)&#8764;D I h w,b (x) &#8226; y = 0. Therefore, the optimal scaling of h w,b is 0, which incurs a loss L D I (0) = 1 2 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Lower Bounds for General Kernels</head><p>In order to prove lower bounds against general kernels, we recall the lower bounds proved on the dimension or norm of a (probabilistic) feature map needed to be able to represent certain hypothesis classes that were proved by <ref type="bibr">Kamath et al. (2020)</ref>, via probabilistic variants of dimensional and margin complexity. Consider a "fixed-marginal" learning problem P over X &#215; Y with Y &#8838; R, where each D &#8712; P corresponds to a hypotheses in a class H &#8838; (X &#8594; Y). Namely, sampling (x, y) &#8764; D h (for any h &#8712; H) is equivalent to sampling x &#8764; D X (for a fixed marginal D X ) and setting y = h(x). We say that P is "orthogonal" if E x&#8764;D X h(x)h &#8242; (x) = 0 for all h, h &#8242; &#8712; H, and we say P is "normalized" if E x&#8764;D X h(x) 2 = 1 for all h &#8712; H. Theorem 3. <ref type="bibr">[Kamath et al. (2020)</ref>] Consider a fixed-marginal, orthogonal and normalized learning problem P. For any probabilistic kernel K corresponding to p-dimensional feature maps, if for all D &#8712; P it holds for &#8467; sq that for any choice of &#951; &gt; 0, to get that</p><p>We set &#951; = &#947;/2, so that we can take p &#8242; &#8804; B 2 &#8226;O(1/&#947; 2 ). From Part (i), we now have that p &#8242; &#8805; 2(&#947; -&#951;)&#8226;|P| = &#947; &#8226;|P|.</p><p>Putting everything together, we get that B 2 &#8805; &#8486;(&#947; 3 &#8226; |P|).</p><p>We now use Theorem 3 to prove lower bounds in Sections 3 and 5. The key idea is that while, the learning problems considered are either not orthogonal, or do not have fixed marginals, they contain a large component that is a learning problem with fixed marginals that is orthogonal. Combining with (35), we get</p><p>This completes the proof by noting that L D I (0) = 1/2 for all D I &#8712; P.</p><p>Remark 4. One can also directly bound the optimal edge of a kernel with feature map &#966; p = (&#966; 1 , . . . , &#966; p ) when the input distribution is i.  -t) . The second contribution can then be upper-bounded with techniques similar as to the previous section; with a bound that is multiplicative in p but exponential in d log(k/n), due to the high-degree contribution. Choosing d appropriately, such d = log(n), then gives a vanishing edge (for both contributions and in total). The mixture distribution D I allows instead for the more direct argument based on <ref type="bibr">Kamath et al. (2020)</ref>. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The quantity C f mixes the scale of the gradients and of the function values. We use a single quantity in order to minimize notation. Separating these would allow for more consistent scaling.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>ReLU networks do not fit in this framework because (1) they are not differentiable; and (2) their gradients are not bounded. The first issue is not significant, since we never rely on a bound on the second derivatives of the model (only the loss), and so we can either approximate the ReLU with an arbitrarily sharp approximation, or use some relaxed notion of a "gradient", as is actually done informally when discussing gradients of ReLU networks. Our results do rely on a bound on the gradient norms, which for ReLU networks depends on the scale of the weights &#952; (formally, it is sufficient to consider the model restricted to some large enough norm ball of weights which we will never exit). By truncating the model outside the scale of weights we would need to consider (in the analysis only), it is sufficient to take the supremum in the definition of C f only over weight settings explored by gradient descent, and so for ReLU networks C f would scale with this scale of &#952;.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>The stepsize doesn't play an important role in our analysis. We can also allow variable or adaptive stepsize sequences-for simplicity of presentation we stick with a fixed stepsize.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>In some regimes, one should in fact distinguish the setting of large sample sets and approximate population gradients in view of the universality result proved for SGD inAbbe and Sandon (2020a,b). We focus here on the approximate setting that better reflects the noisier regime; see discussion inAbbe and Sandon (2020a)  for parities.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>The direct definition (6) is sufficient for our purposes, and so we refrain from getting into the definition of an RKHS and the RKHS norm. See, e.g.<ref type="bibr">Sch&#246;lkopf and Smola (2002)</ref>, for a definition and discussion. In (7) we take the norm of f to be infinite if f is not in the RKHS.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>The uniform component in the mixture is used to facilitate the establishment of a lower bound for kernels by directly relying on Lemma 5 in<ref type="bibr">Kamath et al. (2020)</ref>. One could also directly lower bound the kernel error of a biased input distribution without the uniform component, by going beyond a standard application of<ref type="bibr">Kamath et al. (2020)</ref>; see Remark 4.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>For each instance DI &#8712; P bsp [n, k, &#945;], there is of course always a point &#952; at which the tangent kernel allows for prediction matching that of Gradient Descent, namely the iterate &#952; (T ) reached by Gradient Descent. But to learn using a kernel, the kernel should be chosen without knowing the instance, i.e. without knowing I.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7"><p>The reason for the offset 5 3 is to ensure all coordinates will be non-negative after the first step, which simplifies identification of the two regimes in (27) when implementing the model</p></note>
		</body>
		</text>
</TEI>
