<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks</title></titleStmt>
			<publicationStmt>
				<publisher>Project Euclid</publisher>
				<date>04/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10529665</idno>
					<idno type="doi">10.1214/23-AAP2010</idno>
					<title level='j'>The Annals of Applied Probability</title>
<idno>1050-5164</idno>
<biblScope unit="volume">34</biblScope>
<biblScope unit="issue">2</biblScope>					

					<author>Zhichao Wang</author><author>Yizhe Zhu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this paper, we investigate a two-layer fully connected neural network of the form f (X) = 1 √ d 1 a σ (W X), where X ∈ d 0 ×n is a deterministic data matrix, W ∈ R d 1 ×d 0 and a ∈ R d 1 are random Gaussian weights, and σ is a nonlinear activation function. We study the limiting spectral distributions of two empirical kernel matrices associated with f (X): the empirical conjugate kernel (CK) and neural tangent kernel (NTK), beyond the linear-width regime (d 1 n). We focus on the ultra-wide regime, where the width d 1 of the first layer is much larger than the sample size n. Under appropriate assumptions on X and σ , a deformed semicircle law emerges as d 1 /n → ∞ and n → ∞. We first prove this limiting law for generalized sample covariance matrices with some dependency. To specify it for our neural network model, we provide a nonlinear Hanson-Wright inequality suitable for neural networks with random weights and Lipschitz activation functions. We also demonstrate nonasymptotic concentrations of the empirical CK and NTK around their limiting kernels in the spectral norm, along with lower bounds on their smallest eigenvalues. As an application, we show that random feature regression induced by the empirical kernel achieves the same asymptotic performance as its limiting kernel regression under the ultra-wide regime. This allows us to calculate the asymptotic training and test errors for random feature regression using the corresponding kernel regression.
CONTENTS]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1. Introduction. Nowadays, deep neural networks have become one of the leading models in machine learning, and many theoretical results have been established to understand the training and generalization of neural networks. Among them, two kernel matrices are prominent in deep learning theory: conjugate kernel (CK) <ref type="bibr">[24,</ref><ref type="bibr">27,</ref><ref type="bibr">44,</ref><ref type="bibr">54,</ref><ref type="bibr">58,</ref><ref type="bibr">69,</ref><ref type="bibr">71,</ref><ref type="bibr">74,</ref><ref type="bibr">82]</ref> and neural tangent kernel (NTK) <ref type="bibr">[4,</ref><ref type="bibr">28,</ref><ref type="bibr">40]</ref>. The CK matrix defined in <ref type="bibr">(5)</ref>, which has been exploited to study the generalization of random feature regression, is the Gram matrix of the output of the last hidden layer on the training dataset. The NTK matrix, defined in <ref type="bibr">(7)</ref>, is the Gram matrix of the Jacobian of the neural network with respect to training parameters, characterizing the performance of a wide neural network through gradient flows. Both are related to the kernel machine and help us explore the generalization and training process of the neural network.</p><p>We are interested in the behaviors of CK and NTK matrices at random initialization. A recent line of work has proved that these two random kernel matrices will converge to their expectations when the width of the network becomes infinitely wide <ref type="bibr">[7,</ref><ref type="bibr">40]</ref>. Although CK and NTK are usually referred to as these expected kernels in literature, we will always call CK and NTK the empirical kernel matrices in this paper, with a slight abuse of terminology.</p><p>In this paper, we study the random CK and NTK matrices of a two-layer fully connected neural network with input data X &#8712; R d 0 &#215;n , given by f : R d 0 &#215;n &#8594; R n such that <ref type="bibr">(1)</ref> f (X)</p><p>where W &#8712; R d 1 &#215;d 0 is the weight matrix for the first layer, a &#8712; R d 1 are the second layer weights, and &#963; is a nonlinear activation function applied to the matrix W X elementwisely. We assume that all entries of a and W are independently identically distributed by the standard Gaussian N (0, 1). We will always view the input data X as a deterministic matrix (independent of the random weights in a and W ) with certain assumptions. In terms of random matrix theory, we study the difference between these two kernel matrices (CK and NTK) and their expectations with respect to random weights, showing both asymptotic and nonasymptotic behaviors of these differences as the width of the first hidden layer d 1 is growing faster than the number of samples n. As an extension of <ref type="bibr">[29]</ref>, we prove that when n/d 1 &#8594; 0, the centered CK and NTK with appropriate normalization have the limiting eigenvalue distribution given by a deformed semicircle law, determined by the training data spectrum and the nonlinear activation function. To prove this global law, we further set up a limiting law theorem for centered sample covariance matrices with dependent structures and a nonlinear version of the Hanson-Wright inequality. These two results are very general, which makes them potentially applicable to different scenarios beyond our neural network model. For the nonasymptotic analysis, we establish concentration inequalities between the random kernel matrices and their expectations. As a byproduct, we provide lower bounds of the smallest eigenvalues of CK and NTK, which are essential for the global convergence of gradient-based optimization methods when training a wide neural network <ref type="bibr">[59,</ref><ref type="bibr">60,</ref><ref type="bibr">63]</ref>. Because of the nonasymptotic results for kernel matrices, we can also describe how close the performances of the random feature regression and the limiting kernel regression are with a general dataset, which allows us to compute the limiting training error and generalization error for the random feature regression via its corresponding kernel regression in the ultra-wide regime.</p><p>1.1. Nonlinear random matrix theory in neural networks. Recently, the limiting spectra of CK and NTK at random initialization have received increasing attention from a random matrix theory perspective. Most of the papers focus on the linear-width regime d 1 &#8733; n, using both the moment method and Stieltjes transforms. Based on moment methods, <ref type="bibr">[67]</ref> first computed the limiting law of the CK for two-layer neural networks with centered nonlinear activation functions, which is further described as a deformed Marchenko-Pastur law in <ref type="bibr">[64]</ref>. This result has been extended to sub-Gaussian weights and input data with real analytic activation functions by <ref type="bibr">[19]</ref>, even for multiple layers with some special activation functions. Later, <ref type="bibr">[2]</ref> generalized their results by adding a random bias vector in pre-activation and a more general input data matrix. Similar results for the two-layer model with a random bias vector and random input data were analyzed in <ref type="bibr">[68]</ref> by cumulant expansion. In parallel, by Stieltjes transform, <ref type="bibr">[52]</ref> investigated the CK of a one-hidden-layer network with general deterministic input data and Lipschitz activation functions via some deterministic equivalent. <ref type="bibr">[49]</ref> further developed a deterministic equivalent for the Fourier feature map. With the help of the Gaussian equivalent technique and operator-valued free probability theory, the limiting spectrum of NTK with one hidden layer has been analyzed in <ref type="bibr">[3]</ref>. Then the limiting spectra of CK and NTK of a multi-layer neural network with general deterministic input data have been fully characterized in <ref type="bibr">[29]</ref>, where the limiting spectrum of CK is given by the propagation of the Marchenko-Pastur map through the network, while the NTK is approximated by the linear combination of CK's of each hidden layer. <ref type="bibr">[29]</ref> illustrated that the pairwise approximate orthogonality assumption on the input data is preserved in all hidden layers. Such a property is useful to approximate the expected CK and NTK. We refer to <ref type="bibr">[32]</ref> as a summary of the recent development in nonlinear random matrix theory.</p><p>Most of the results in nonlinear random matrix theory focus on the case when d 1 is proportional to n as n &#8594; &#8734;. We build a random matrix result for both CK and NTK under the ultra-wide regime, where d 1 /n &#8594; &#8734; and n &#8594; &#8734;. As an intrinsic interest of this regime, this exhibits the connection between wide (or overparameterized) neural networks and kernel learning induced by limiting kernels of CK and NTK. In this article, we will follow general assumptions on the input data and activation function in <ref type="bibr">[29]</ref> and study the limiting spectra of the centered and normalized CK matrix</p><p>where Y := &#963; (W X). Similar results for the NTK can be obtained as well. To complete the proofs, we establish a nonlinear version of the Hanson-Wright inequality, which has previously appeared in <ref type="bibr">[49,</ref><ref type="bibr">52]</ref>. This nonlinear version is a generalization of the original Hanson-Wright inequality <ref type="bibr">[1,</ref><ref type="bibr">36,</ref><ref type="bibr">72]</ref>, and may have various applications in statistics, machine learning, and other areas. In addition, we also derive a deformed semicircle law for normalized sample covariance matrices without independence in columns. This result is of independent interest in random matrix theory as well.</p><p>1.2. General sample covariance matrices. We observe that the random matrix Y &#8712; R d 1 &#215;n defined above has independent and identically distributed rows. Hence, Y Y is a generalized sample covariance matrix. We first inspect a more general sample covariance matrix Y whose rows are independent copies of some random vector y &#8712; R n . Assuming n and d 1 both go to infinity but n/d 1 &#8594; 0, we aim to study the limiting empirical eigenvalue distribution of centered Wishart matrices in the form of <ref type="bibr">(2)</ref> with certain conditions on y. This regime is also related to the ultra-high-dimensional setting in statistics <ref type="bibr">[70]</ref>.</p><p>This regime has been studied for decades starting in <ref type="bibr">[14]</ref>, where Y has i.i.d. entries and E[Y Y ] = d 1 Id. In this setting, by the moment method, one can obtain the semicircle law.</p><p>This normalized model also arises in quantum theory with respect to random induced states (see <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">26]</ref>). The largest eigenvalue of such a normalized sample covariance matrix has been considered in <ref type="bibr">[22]</ref>. Subsequently, <ref type="bibr">[21,</ref><ref type="bibr">45,</ref><ref type="bibr">70,</ref><ref type="bibr">87]</ref> analyzed the fluctuations for the linear spectral statistics of this model and applied this result to hypothesis testing for the covariance matrix. A spiked model for sample covariance matrices in this regime was recently studied in <ref type="bibr">[30]</ref>. This kind of semicircle law also appears in many other random matrix models. For instance, <ref type="bibr">[42]</ref> showed this limiting law for normalized sample correlation matrices. Also, the semicircle law for centered sample covariance matrices has already been applied in machine learning: <ref type="bibr">[31]</ref> controlled the generalization error of shallow neural networks with quadratic activation functions by the moments of this limiting semicircle law; <ref type="bibr">[35]</ref> derived a semicircle law of the fluctuation matrix between stochastic batch Hessian and the deterministic empirical Hessian of deep neural networks.</p><p>For general sample covariance, <ref type="bibr">[80]</ref> considered the form Y = BXA<ref type="foot">foot_0</ref>/2 with deterministic A and B, where X consists of i.i.d. entries with mean zero and variance one. The same result has been proved in <ref type="bibr">[16]</ref> by generalized Stein's method. Unlike previous results, <ref type="bibr">[85]</ref> tackled the general case, only assuming Y has independent rows with some deterministic covariance n . Though this is similar to our model in Section 4, we will consider more general assumptions on each row of Y , which can be directly verified in our neural network models.</p><p>1.3. Infinite-width kernels and the smallest eigenvalues of empirical kernels. Besides the above asymptotic spectral fluctuation of (2), we provide nonasymptotic concentrations of (2) in spectral norm and a corresponding result for the NTK. In the infinite-width networks, where d 1 &#8594; &#8734; and n are fixed, both CK and NTK will converge to their expected kernels. This has been investigated in <ref type="bibr">[27,</ref><ref type="bibr">44,</ref><ref type="bibr">54,</ref><ref type="bibr">74]</ref> for the CK and <ref type="bibr">[4,</ref><ref type="bibr">7,</ref><ref type="bibr">28,</ref><ref type="bibr">40,</ref><ref type="bibr">47]</ref> for the NTK. Such kernels are also called infinite-width kernels in literature. In this current work, we present the precise probability bounds for concentrations of CK and NTK around their infinite-width kernels, where the difference is of order &#8730; n/d 1 . Our results permit more general activation functions and input data X only with pairwise approximate orthogonality, albeit similar concentrations have been applied in <ref type="bibr">[3,</ref><ref type="bibr">10,</ref><ref type="bibr">39,</ref><ref type="bibr">57,</ref><ref type="bibr">76]</ref>.</p><p>A corollary of our concentration is the explicit lower bounds of the smallest eigenvalues of the CK and the NTK. Such extreme eigenvalues of the NTK have been utilized to prove the global convergence of gradient descent algorithms of wide neural networks since the NTK governs the gradient flow in the training process, see, for example, <ref type="bibr">[6,</ref><ref type="bibr">23,</ref><ref type="bibr">28,</ref><ref type="bibr">59,</ref><ref type="bibr">60,</ref><ref type="bibr">63,</ref><ref type="bibr">76,</ref><ref type="bibr">83]</ref>. The smallest eigenvalue of NTK is also crucial for proving generalization bounds and memorization capacity in <ref type="bibr">[6,</ref><ref type="bibr">57]</ref>. Analogous to Theorem 3.1 in <ref type="bibr">[57]</ref>, our lower bounds are given by the Hermite coefficients of the activation function &#963; . Besides, the lower bound of NTK for multi-layer ReLU networks is analyzed in <ref type="bibr">[61]</ref>.</p><p>1.4. Random feature regression and limiting kernel regression. Another byproduct of our concentration results is to measure the difference of performance between random feature regression with respect to 1</p><p>Y and corresponding kernel regression when d 1 /n &#8594; &#8734;. Random feature regression can be viewed as the linear regression of the last hidden layer, and its performance has been studied in, for instance, <ref type="bibr">[33,</ref><ref type="bibr">38,</ref><ref type="bibr">49,</ref><ref type="bibr">50,</ref><ref type="bibr">52,</ref><ref type="bibr">53,</ref><ref type="bibr">55,</ref><ref type="bibr">56,</ref><ref type="bibr">67]</ref> under the linear-width regime. 1 In this regime, the CK matrix 1 d 1 Y Y is not concentrated around its expectation</p><p>under the spectral norm, where w is the standard normal random vector in R d 0 . But the limiting spectrum of CK is exploited to characterize the asymptotic performance and double descent phenomenon of random feature regression when n, d 0 , d 1 &#8594; &#8734; proportionally. Several works have also utilized this regime to depict the performance of the ultra-wide random network by letting d 1 /n &#8594; &#968; &#8712; (0, &#8734;) first, getting the asymptotic performance and then taking &#968; &#8594; &#8734; (see <ref type="bibr">[56,</ref><ref type="bibr">86]</ref>). However, there is still a difference between this sequential limit and the ultra-wide regime. Before these results, random feature regression has already attracted significant attention in that it is a random approximation of the Reproducing Kernel Hilbert Space (RKHS) defined by population kernel function K : R</p><p>when width d 1 is sufficiently large <ref type="bibr">[11,</ref><ref type="bibr">12,</ref><ref type="bibr">71,</ref><ref type="bibr">73]</ref>. We point out that Theorem 9 of <ref type="bibr">[10]</ref> has the same order &#8730; n/d 1 of the approximation as ours, despite only for random Fourier features. In our work, the concentration between empirical kernel induced by 1  d 1 Y Y and the population kernel matrix K defined in (4) for X leads to the control of the differences of training/test errors between random feature regression and kernel regression, which were previously concerned by <ref type="bibr">[10,</ref><ref type="bibr">41,</ref><ref type="bibr">55,</ref><ref type="bibr">57]</ref> in different cases. Specifically, <ref type="bibr">[41]</ref> obtained the same kind of estimation but considered random features sampled from Gaussian processes. Our results explicitly show how large width d 1 should be so that the random feature regression gets the same asymptotic performance as kernel regression <ref type="bibr">[55]</ref>. With these estimations, we can take the limiting test error of the kernel regression to predict the limiting test error of random feature regression as n/d 1 &#8594; 0 and d 0 , n &#8594; &#8734;. We refer <ref type="bibr">[46,</ref><ref type="bibr">47,</ref><ref type="bibr">51,</ref><ref type="bibr">55]</ref>, <ref type="bibr">[18]</ref>, Section 4.3, and references therein for more details in high-dimensional kernel ridge/ridgeless regressions. We emphasize that the optimal prediction error of random feature regression in linear-width regime is actually achieved in the ultra-wide regime, which boils down to the limiting kernel regression, see <ref type="bibr">[53,</ref><ref type="bibr">55,</ref><ref type="bibr">56,</ref><ref type="bibr">86]</ref>. This is one of the motivations for studying the ultra-wide regime and the limiting kernel ridge regression.</p><p>In the end, we would like to mention the idea of spectral-norm approximation for the expected kernel , which helps us describe the asymptotic behavior of limiting kernel regression. For specific activation &#963; , kernel has an explicit formula, see <ref type="bibr">[48,</ref><ref type="bibr">49,</ref><ref type="bibr">52]</ref>, whereas generally, it can be expanded in terms of the Hermite expansion of &#963; <ref type="bibr">[29,</ref><ref type="bibr">56,</ref><ref type="bibr">67]</ref>. Thanks to pairwise approximate orthogonality introduced in <ref type="bibr">[29]</ref>, Definition 3.1, we can approximate in the spectral norm for general deterministic data X. This pairwise approximate orthogonality defines how orthogonal is within different input vectors of X. With certain i.i.d. assumption on X, <ref type="bibr">[47]</ref> and <ref type="bibr">[18]</ref>, Section 4.3, where the scaling d 0 &#8733; n &#945; , for &#945; &#8712; (0, 1], determined which degree of the polynomial kernel is sufficient to approximate . Instead, our theory leverages the approximate orthogonality among general datasets X to obtain a similar approximation. Our analysis presumably indicates that the weaker orthogonality X has, the higher degree of the polynomial kernel we need to approximate the kernel .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.5.">Preliminaries.</head><p>Notation. We use tr(A) = 1 n i A ii as the normalized trace of a matrix A &#8712; R n&#215;n and Tr(A) = i A ii . Denote vectors by lowercase boldface. A is the spectral norm for matrix A, A F denotes the Frobenius norm, and x is the 2 -norm of any vector x. A B is the Hadamard product of two matrices, that is,</p><p>and Var w [&#8226;] be the expectation and variance only with respect to random vector w. Given any vector v, diag(v) is a diagonal matrix where the main diagonal elements are given by v. &#955; min (A) is the smallest eigenvalue of any Hermitian matrix A.</p><p>Before stating our main results, we describe our model with assumptions. We first consider the output of the first hidden layer and empirical conjugate kernel (CK):</p><p>(5)</p><p>Observe that the rows of matrix Y are independent and identically distributed since only W is random and X is deterministic. Let the ith row of Y be y i , for 1 &#8804; i &#8804; d 1 . Then, we obtain a sample covariance matrix,</p><p>which is the sum of d 1 independent rank-one random matrices in R n&#215;n . Let the second moment of any row y i be (3). Later on, we will approximate based on the assumptions of input data X.</p><p>Next, we define the empirical neural tangent kernel (NTK) for (1), denoted by H &#8712; R n&#215;n . From Section 3.3 in <ref type="bibr">[29]</ref>, the (i, j )th entry of H can be explicitly written as</p><p>where w r is the rth row of weight matrix W , x i is the ith column of matrix X, and a r is rth entry of the output layer a. In the matrix form, H can be written by</p><p>where the &#945;th column of S is given by diag</p><p>We introduce the following assumptions for the random weights, nonlinear activation function &#963; , and input data. These assumptions are basically carried on from <ref type="bibr">[29]</ref>.</p><p>ASSUMPTION 1.1. The entries of W and a are i.i.d. and distributed by N (0, 1). ASSUMPTION 1.2. Activation function &#963; (x) is a Lipschitz function with the Lipschitz constant &#955; &#963; &#8712; (0, &#8734;). Assume that &#963; is centered and normalized with respect to &#958; &#8764; N (0, 1) such that</p><p>Furthermore, &#963; satisfies either of the following:</p><p>for some constants a, b, c &#8712; R such that (9) holds.</p><p>Analogously to <ref type="bibr">[39]</ref>, our Assumption 1.2 permits &#963; to be the commonly used activation functions, including ReLU, Sigmoid, and Tanh, although we have to center and normalize the activation functions to guarantee <ref type="bibr">(9)</ref>. Such normalized activation functions exclude some trivial spike in the limiting spectra of CK and NTK <ref type="bibr">[19,</ref><ref type="bibr">29]</ref>. The foregoing assumptions ensure our nonlinear Hanson-Wright inequality in the proof. As a future direction, going beyond Gaussian weights and Lipschitz activation functions may involve different types of concentration inequalities.</p><p>Next, we present the conditions of the deterministic input data X and the asymptotic regime for our main results. Define the following (&#949;, B)-orthonormal property for our data matrix X. DEFINITION 1.3. For given any &#949;, B &gt; 0, matrix X is (&#949;, B)-orthonormal if for any distinct columns x &#945; , x &#179; in X, we have</p><p>and also</p><p>The empirical spectral distribution &#956;0 of X X converges weakly to a fixed and nondegenerate probability distribution &#956; 0 = &#181; 0 on [0, &#8734;).</p><p>In above (b), the (&#949; n , B)-orthonormal property with n&#949; 4 n = o(1) is a quantitative version of pairwise approximate orthogonality for the column vectors of the data matrix X &#8712; R d 0 &#215;n . When d 0 n, it holds, with high probability, for many random X with independent columns x &#945; , including the anisotropic Gaussian vectors x &#945; &#8764; N (0, ) with tr( ) = 1 and 1/n, vectors generated by Gaussian mixture models, and vectors satisfying the log-Sobolev inequality or convex Lipschitz concentration property. See <ref type="bibr">[29]</ref>, Section 3.1, for more details. Specifically, when x &#945; 's are independently sampled from the unit sphere S d 0 -1 , X is (&#949; n , B)orthonormal with high probability where &#949; n = O( log(n) n ) and B = O(1) as n d 0 . In this case, for any &gt; 2, we have n&#949; n &#8594; 0. In our theory, we always treat X as a deterministic matrix. However, our results also work for random input X independent of weights W and a by conditioning on the high probability event that X satisfies (&#949; n , B)-orthonormal property. Unlike data vectors with independent entries, our assumption is promising to analyze realworld datasets <ref type="bibr">[53]</ref> and establish some n-dependent deterministic equivalents like <ref type="bibr">[49]</ref>.</p><p>The following Hermite polynomials are crucial to the approximation of in our analysis.</p><p>DEFINITION 1.5 (Normalized Hermite polynomials). The rth normalized Hermite polynomial is given by</p><p>Here {h r } &#8734; r=0 form an orthonormal basis of L 2 (R, ), where denotes the standard Gaussian distribution. For &#963; 1 , &#963; 2 &#8712; L 2 (R, ), the inner product is defined by</p><p>Every function &#963; &#8712; L 2 (R, ) can be expanded as a Hermite polynomial expansion</p><p>where &#950; r (&#963; ) is the rth Hermite coefficient defined by</p><p>In the following statements and proofs, we denote &#958; &#8764; N (0, 1). Then for any k &#8712; N, we have</p><p>We define the innerproduct kernel random matrix f k (X X) &#8712; R n&#215;n by applying f k entrywise to X X. Define a deterministic matrix <ref type="bibr">(11)</ref> 0</p><p>where the &#945;th</p><p>We will employ 0 as an approximation of the population covariance in (3) in the spectral norm when n&#949; 4 n &#8594; 0. For any n &#215; n Hermitian matrix A n with eigenvalues &#955; 1 , . . . , &#955; n , the empirical spectral distribution of A is defined by</p><p>We write lim spec(A n ) = &#956; if &#956; A n &#8594; &#956; weakly as n &#8594; &#8734;. The main tool we use to study the limiting spectral distribution of a matrix sequence is the Stieltjes transform defined as follows.</p><p>DEFINITION 1.6 (Stieltjes transform). Let &#956; be a probability measure on R. The Stieltjes transform of &#956; is a function s(z) defined on the upper half plane C + by</p><p>For any n &#215; n Hermitian matrix A n , the Stieltjes transform of the empirical spectral distribution of A n can be written as tr(A nz Id) -1 . We call (A nz Id) -1 the resolvent of A n .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Main results.</head><p>2.1. Spectra of the centered CK and NTK. Our first result is a deformed semicircle law for the CK matrix. Denote by &#956;0 = (1b &#963; ) 2 + b 2 &#963; &#956; 0 the distribution of (1b 2 &#963; ) + b 2 &#963; &#955; with &#955; sampled from the distribution &#956; 0 . The limiting law of our centered and normalized CK matrix is depicted by &#956; s &#956;0 , where &#956; s is the standard semicircle law and the notation is the free multiplicative convolution in free harmonic analysis. For full descriptions of free independence and free multiplicative convolution, see <ref type="bibr">[62]</ref>, <ref type="bibr">Lecture 18,</ref><ref type="bibr">and [5]</ref>, Section 5.3.3. The free multiplicative convolution was first introduced in <ref type="bibr">[79]</ref>, which later has many applications for products of asymptotic free random matrices. The main tool for computing free multiplicative convolution is the S-transform, invented by <ref type="bibr">[79]</ref>. S-transform was recently utilized to study the dynamical isometry of deep neural networks <ref type="bibr">[25,</ref><ref type="bibr">37,</ref><ref type="bibr">65,</ref><ref type="bibr">66,</ref><ref type="bibr">84]</ref>. Some basic properties and intriguing examples for free multiplicative convolution with &#956; s can also be found in <ref type="bibr">[15]</ref>, Theorems 1.2, 1.3. THEOREM 2.1 (Limiting spectral distribution for the conjugate kernel). Suppose Assumptions 1.1, 1.2, and 1.4 of the input matrix X hold, the empirical eigenvalue distribution of</p><p>also converges weakly to the probability measure &#956; almost surely, whose Stieltjes transform m(z) is defined by</p><p>for each z &#8712; C + , where &#179;(z) &#8712; C + is the unique solution to</p><p>Suppose that we additionally have b &#963; = 0, that is, E[&#963; (&#958; )] = 0. In this case, our Theorem 2.1 shows that the limiting spectral distribution of (2) is the semicircle law, and from <ref type="bibr">(13)</ref>, the deterministic data matrix X does not have an effect on the limiting spectrum. See Figure <ref type="figure">1</ref> for a cosine-type &#963; with b &#963; = 0. The only effect of the nonlinearity in &#956; is the coefficient b &#963; in the deformation &#956;0 .  Appendix B with different activation functions. The red curves are implemented by the selfconsistent equations ( <ref type="formula">15</ref>) and ( <ref type="formula">16</ref>) in Theorem 2.1. In Section 4, we present general random matrix models with similar limiting eigenvalue distribution as &#956; whose Stieltjes transform is also determined by ( <ref type="formula">15</ref>) and ( <ref type="formula">16</ref>).</p><p>Theorem 2.1 can be extended to the NTK model as well. Denote by</p><p>As an approximation of in the spectral norm, we define</p><p>where f k 's are defined in <ref type="bibr">(11)</ref>, a &#963; is defined in <ref type="bibr">(10)</ref>, and the kth Hermite coefficient of &#963; is</p><p>Then, a similar deformed semicircle law can be obtained for the empirical NTK matrix H . THEOREM 2.2 (Limiting spectral distribution of the NTK). Under Assumptions 1.1, 1.2, and 1.4 of the input matrix X, the empirical eigenvalue distribution of (20)</p><p>weakly converges to &#956; almost surely, where 0 and 0 are defined in <ref type="bibr">(11)</ref> and <ref type="bibr">(18)</ref>, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Nonasymptotic estimations.</head><p>With our nonlinear Hanson-Wright inequality (Corollary 3.5), we attain the following concentration bound on the CK matrix in the spectral norm.</p><p>Then with probability at least</p><p>where C &gt; 0 is a universal constant. REMARK 2.4. Theorem 2.3 ensures that the empirical spectral measure &#956; n of the centered random matrix d 1 n ( 1 d 1 Y Y -) has a bounded support for all sufficiently large n. Together with the global law in Theorem 2.1, our concentration inequality <ref type="bibr">(22)</ref> is tight up to a constant factor. Additionally, by the weak convergence of &#956; n to &#956; proved in Theorem 2.1, we can take a test function x 2 to obtain that</p><p>Based on the foregoing estimation, we have the following lower bound on the smallest eigenvalue of the empirical conjugate kernel, denoted by &#955; min ( 1</p><p>THEOREM 2.5. Suppose Assumptions 1.1 and 1.2 hold and &#963; is not a linear function, X is (&#949; n , B)-orthonormal. Then with probability at least 1 -4e -2n ,</p><p>where C B is a constant depending on B. In particular, if</p><p>REMARK 2.6. A related result in <ref type="bibr">[63]</ref>, Theorem 5, assumed</p><p>. We relax the assumption on the column vectors of X, and extend the range of d 1 down to</p><p>) is lower bounded by an absolute constant, with an extra assumption that E[&#963; (&#958; )] = 0. This assumption can always be satisfied by shifting the activation function with a proper constant. Our bound for &#955; min ( ) is derived via Hermite polynomial expansion, similar to <ref type="bibr">[63]</ref>, Lemma 15. However, we apply an &#949;-net argument for concentration bound for 1 d 1 Y Y around , while a matrix Chernoff concentration bound with truncation was used in <ref type="bibr">[63]</ref>, Theorem 5.</p><p>Additionally, the concentration for the NTK matrix H can be obtained in the next theorem. Recall that H is defined by <ref type="bibr">(7)</ref> and the columns of S are defined by ( <ref type="formula">8</ref>) with Assumption 1.1. THEOREM 2.7. Suppose d 1 &#8805; log n, and &#963; is &#955; &#963; -Lipschitz. Then with probability at least</p><p>Moreover, if the assumptions in Theorem 2.3 hold, then with probability at least</p><p>REMARK 2.8. Compared to Proposition D.3 in <ref type="bibr">[39]</ref>, we assume a is a Gaussian vector instead of a Rademacher random vector and attain a better bound. If a i &#8712; {+1, -1}, then one can apply matrix Bernstein inequality for the sum of bounded random matrices. In our case, the boundedness condition is not satisfied. Section S1.1 in <ref type="bibr">[3]</ref> applied matrix Bernstein inequality for the sum of bounded random matrices when a is a Gaussian vector, but the boundedness condition does not hold in equation (S7) of <ref type="bibr">[3]</ref>.</p><p>Based on Theorem 2.7, we get a lower bound for the smallest eigenvalue of the NTK. THEOREM 2.9. Under Assumptions 1.1 and 1.2, suppose that X is (&#949; n , B)-orthonormal, &#963; is not a linear function, and d 1 &#8805; log n. Then with probability at least 1n -7/3 ,</p><p>where C B is a constant depending only on B, and &#951; k (&#963; ) is defined in <ref type="bibr">(19)</ref>. In particular, if</p><p>, and d 1 = &#969;(log n), then with high probability,</p><p>REMARK 2.10. We relax the assumption in <ref type="bibr">[61]</ref> to d 1 = &#969;(log n) for the 2-layer case and our result is applicable beyond the ReLU activation function and to more general assumptions on X. Our proof strategy is different from <ref type="bibr">[61]</ref>. In <ref type="bibr">[61]</ref>, the authors used the inequality &#955; min ((S S) (X X)) &#8805; min i S i 2 2 &#8226; &#955; min (X X) where S i is the ith column of S. Then, getting the lower bound is reduced to show the concentration of the 2-norm of the column vectors of S. Here we apply a matrix concentration inequality to (S S) (X X) and gain a weaker assumption on d 1 to ensure the lower bound on &#955; min (H ). REMARK 2.11. In Theorems 2.5 and 2.9, we exclude the linear activation function. When &#963; (x) = x, it is easy to check both 1 d 1 &#955; min (Y Y ) and &#955; min (H ) will trivially determined by &#955; min (X X), which can be vanishing. In this case, the lower bounds of the smallest eigenvalues of CK and NTK rely on the assumption of &#956; 0 or the distribution of X. For instance, when the entries of X are i.i.d. Gaussian random variables, &#955; min (X X) has been analyzed in <ref type="bibr">[75]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.3.</head><p>Training and test errors for random feature regression. We apply the results of the preceding sections to a two-layer neural network at random initialization defined in <ref type="bibr">(1)</ref>, to estimate the training errors and test errors with mean-square losses for random feature regression under the ultra-wide regime where d 1 /n &#8594; &#8734; and n &#8594; &#8734;. In this model, we take the random feature 1</p><p>&#963; (W X) and consider the regression with respect to &#952; &#8712; R d 1 based on</p><p>with training data X &#8712; R d 0 &#215;n and training labels y &#8712; R n . Considering the ridge regression with ridge parameter &#955; &#8805; 0 and squared loss defined by</p><p>we can conclude that the minimization &#952; := arg min &#952; L(&#952; ) has an explicit solution</p><p>where Y = &#963; (W X) is defined in <ref type="bibr">(5)</ref>. When &#963; is nonlinear, by Theorem 2.5, it is feasible to take inverse in <ref type="bibr">(26)</ref> for any &#955; &#8805; 0. Hence, in the following results, we will focus on nonlinear activation functions. <ref type="foot">2</ref> In general, the optimal predictor for this random feature with respect to <ref type="bibr">(25)</ref> is</p><p>where we define an empirical kernel</p><p>The n-dimension row vector is given by</p><p>and the (i, j ) entry of K n (X, X) is defined by</p><p>The optimal kernel predictor with a ridge parameter &#955; &#8805; 0 for the kernel ridge regression is given by (see <ref type="bibr">[10,</ref><ref type="bibr">18,</ref><ref type="bibr">41,</ref><ref type="bibr">46,</ref><ref type="bibr">51,</ref><ref type="bibr">71]</ref> for more details)</p><p>where K(X, X) is an n &#215; n matrix such that its (i, j ) entry is K(x i , x j ), and K(x, X) is a row vector in R n similarly with <ref type="bibr">(29)</ref>. We compare the characteristics of the two different predictors f (RF ) &#955; (x) and f (K) &#955; (x) when the kernel function K is defined in (4). Denote the optimal predictors for random features and kernel K on training data X by <ref type="bibr">(5)</ref>. We aim to compare the training and test errors for these two predictors in ultra-wide random neural networks, respectively. Let training errors of these two predictors be</p><p>In the following theorem, we show that, with high probability, the training error of the random feature regression model can be approximated by the corresponding kernel regression model with the same ridge parameter &#955; &#8805; 0 for ultra-wide neural networks. THEOREM 2.12 (Training error approximation). Suppose Assumptions 1.1, 1.2, and 1.4 hold, and &#963; is not a linear function. Then, for all large n, with probability at least 1 -4e -2n ,</p><p>where constants C 1 and C 2 only depend on &#955;, B and &#963; .</p><p>Next, to investigate the test errors (or generalization errors), we introduce further assumptions on the data and the target function that we want to learn from training data. Denote the true regression function by f * : R d 0 &#8594; R. Then, the training labels are defined by</p><p>where " &#8712; R n is the training label noise. For simplicity, we further impose the following assumptions, analogously to <ref type="bibr">[50]</ref>. ASSUMPTION 2.13. Assume that the target function is a linear function f * (x) = &#179; &#179; &#179; * , x , where random vector satisfies &#179; &#179; &#179; * &#8764; N (0, &#963; 2 &#179; &#179; &#179; Id). Then, in this case, the training label vector is given by y</p><p>, and a test data x &#8712; R d 0 is independent with X and y such that X := [x 1 , . . . , x n , x] &#8712; R d 0 &#215;(n+1) is also (&#949; n , B)-orthonormal. For convenience, we further assume the population covariance of the test data is</p><p>REMARK 2.15. Our Assumption 2.14 of the test data x ensures the same statistical behavior as training data in X, but we do not have any explicit assumption of the distribution of x. It is promising to adopt such assumptions to handle statistical models with real-world data <ref type="bibr">[48,</ref><ref type="bibr">49]</ref>. Besides, it is possible to extend our analysis to general population covariance for E x <ref type="bibr">[xx ]</ref>.</p><p>For any predictor f , define the test error (generalization error) by</p><p>We first present the following approximation of the test error of a random feature predictor via its corresponding kernel predictor. THEOREM 2.16 (Test error approximation). Suppose that Assumptions 1.1, 1.2, 2.13, and 2.14 hold, and &#963; is not a linear function. Then, for any &#949; &#8712; (0, 1/2), the difference of test errors satisfies</p><p>Theorems 2.12 and 2.16 verify that the random feature regression achieves the same asymptotic errors as the kernel regression, as long as n/d 1 &#8594; &#8734;. This is closely related to <ref type="bibr">[55]</ref>, Theorem 1, with different settings. Based on that, we can compute the asymptotic training and test errors for the random feature model by calculating the corresponding quantities for the kernel regression in the ultra-wide regime where n/d 1 &#8594; 0. THEOREM 2.17 (Asymptotic training and test errors). Suppose Assumptions 1.1 and 1.2 hold, and &#963; is not a linear function. Suppose the target function f * , training data X, and test data x &#8712; R d 0 satisfy Assumptions 2.13 and 2.14. For any &#955; &#8805; 0, let the effective ridge parameter be <ref type="bibr">(36)</ref> &#955; eff (&#955;, &#963; )</p><p>If the training data has some limiting eigenvalue distribution &#956; 0 = lim spec X X as n &#8594; &#8734; and n/d 0 &#8594; &#180;&#8712; (0, &#8734;), then when n/d 1 &#8594; 0 and n &#8594; &#8734;, the training error satisfies</p><p>and the test error satisfies</p><p>where the bias and variance functions are defined by</p><p>We emphasize that in the proof of Theorem 2.17, we also get n-dependent deterministic equivalents for training/test errors of the kernel regression to approximate the performance of random feature regression. This is akin to <ref type="bibr">[49]</ref>, Theorem 3, and <ref type="bibr">[18]</ref>, Theorem 4.13, but in different regimes. In the following Figure <ref type="figure">3</ref>, we present implementations of test errors for random feature regressions on standard Gaussian random data and their limits <ref type="bibr">(38)</ref>. For simplicity, we fix n, d 0 , only let d 1 &#8594; &#8734;, and use empirical spectral distribution of X X to approximate &#956; 0 in B K (&#955; eff (&#955;, &#963; )) and V K (&#955; eff (&#955;, &#963; )), which is actually the n-dependent deterministic equivalent. However, for Gaussian random matrix X, &#956; 0 is actually a Marchenko-Pastur law with ratio &#180;, so B K (&#955; eff (&#955;, &#963; )) and V K (&#955; eff (&#955;, &#963; )) can be computed explicitly according to <ref type="bibr">[50]</ref>, Definition 1. REMARK 2.18 (Implicit regularization). For nonlinear &#963; , the effective ridge parameter <ref type="bibr">(36)</ref> can be viewed as an inflated ridge parameter since b 2 &#963; &#8712; [0, 1) and &#955; eff &gt; &#955; &#8805; 0. This &#955; eff leads to implicit regularization for our random feature and kernel ridge regressions even for the ridgeless regression with &#955; = 0 <ref type="bibr">[18,</ref><ref type="bibr">41,</ref><ref type="bibr">46,</ref><ref type="bibr">57]</ref>. This effective ridge parameter &#955; eff also shows the effect of the nonlinearity in the random feature and kernel regressions induced by ultra-wide neural networks. REMARK 2.19. For convenience, we only consider the linear target function f * , but in general, the above theorems can also be obtained for nonlinear target functions, for instance, Test errors in solid lines with error bars are computed using an independent test set of size 5000. We average our results over 50 repetitions. Limiting test errors in black dash lines are computed by <ref type="bibr">(38)</ref>, and we take &#956; 0 to be the corresponding Marchenko-Pastur distributions.</p><p>f * is a nonlinear single-index model. Under (&#949; n , B)-orthonormal assumption with n&#949; 4 n &#8594; 0, our expected kernel K(X, X) &#8801; is approximated in terms of <ref type="bibr">(39)</ref> lim spec</p><p>Id , whence, this kernel regression can only learn linear functions. So if f * is nonlinear, the limiting test error should be decomposed into the linear part as <ref type="bibr">(38)</ref> and the nonlinear component as a noise <ref type="bibr">[18]</ref>, Theorem 4.13. For more conclusions of this kernel machine, we refer to <ref type="bibr">[46,</ref><ref type="bibr">47,</ref><ref type="bibr">51,</ref><ref type="bibr">55]</ref>. REMARK 2.20 (Neural tangent regression). In parallel to the above results, we can obtain a similar analysis of the limiting training and test errors for random feature regression in <ref type="bibr">(27)</ref> with empirical NTK given by either K n (X, X) = 1 d 1 (S S) (X X) or K n (X, X) = H . This random feature regression also refers to neural tangent regression <ref type="bibr">[57]</ref>. With the help of our concentration results in Theorem 2.7 and the lower bound of the smallest eigenvalues in Theorem 2.9, we can directly extend the above Theorems 2.12, 2.16, and 2.17 to this neural tangent regression. We omit the proofs in these cases and only state the results as follows.</p><p>If K n (X, X) = 1 d 1 (S S) (X X) with expected kernel K(X, X) = defined by ( <ref type="formula">17</ref>), the limiting training and test errors of this neural tangent regression can be approximated by the kernel regression with respect to , as long as d 1 = &#969;(log n). Analogously to <ref type="bibr">(39)</ref>, we have an additional approximation <ref type="bibr">(40)</ref> lim spec = lim spec</p><p>Under the same assumptions of Theorem 2.17 and replacing n/d 1 &#8594; 0 with d 1 = &#969;(log n), we can conclude that the test error of this neural tangent regression has the same limit as <ref type="bibr">(38)</ref> but changing the effective ridge parameter <ref type="bibr">(36)</ref> </p><p>. This result is akin to <ref type="bibr">[57]</ref>, Corollary 3.2, but permits more general assumptions on X. The limiting training error of this neural tangent regression can be obtained by slightly modifying the coefficient in <ref type="bibr">(37)</ref>.</p><p>Similarly, if K n (X, X) = H defined by ( <ref type="formula">7</ref>) possesses an expected kernel K(X, X) = + , this neural tangent regression in ( <ref type="formula">27</ref>) is close to kernel regression (30) with kernel</p><p>under the ultra-wide regime, n/d 1 &#8594; 0. Combining ( <ref type="formula">39</ref>) and ( <ref type="formula">40</ref>), Theorem 2.17 can directly be extended to this neural tangent regression but replacing <ref type="bibr">(36)</ref> with</p><p>. Section 6.1 of <ref type="bibr">[3]</ref> also calculated this limiting test error when data X is isotropic Gaussian.</p><p>Organization of the paper. The remaining parts of the paper are structured as follows. In Section 3, we first provide a nonlinear Hanson-Wright inequality as a concentration tool for our spectral analysis. Section 4 gives a general theorem for the limiting spectral distributions of generalized centered sample covariance matrices. We prove the limiting spectral distributions for the empirical CK and NTK matrices (Theorem 2.1 and Theorem 2.2) in Section 5. Nonasymptotic estimates in Section 2.2 are proved in Section 6. In Section 7, we justify the asymptotic results of the training and test errors for the random feature model (Theorem 2.12 and Theorem 2.16). Auxiliary lemmas and additional simulations are included in Appendices A and B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">A nonlinear Hanson-Wright inequality.</head><p>We give an improved version of Lemma 1 in <ref type="bibr">[52]</ref> with a simple proof based on a Hanson-Wright inequality for random vectors with dependence <ref type="bibr">[1]</ref>. This serves as the concentration tool for us to prove the deformed semicircle law in Section 5 and provide bounds on extreme eigenvalues in Section 6. We first define some concentration properties for random vectors. DEFINITION 3.1 (Concentration property). Let X be a random vector in R n . We say X has the K-concentration property with constant K if for any 1-Lipschitz function f : R n &#8594; R, we have E|f (X)| &lt; &#8734; and for any t &gt; 0,</p><p>There are many distributions of random vectors satisfying K-concentration property, including uniform random vectors on the sphere, unit ball, hamming or continuous cube, uniform random permutation, etc. See <ref type="bibr">[78]</ref>, Chapter 5, for more details. DEFINITION 3.2 (Convex concentration property). Let X be a random vector in R n . We say X has the K-convex concentration property with the constant K if for any 1-Lipschitz convex function f : R n &#8594; R, we have E|f (X)| &lt; &#8734; and for any t &gt; 0,</p><p>We will apply the following result from <ref type="bibr">[1]</ref> to the nonlinear setting. LEMMA 3.3 (Theorem 2.5 in <ref type="bibr">[1]</ref>). Let X be a mean zero random vector in R n . If X has the K-convex concentration property, then for any n &#215; n matrix A and any t &gt; 0,</p><p>for some universal constant C &gt; 1. THEOREM 3.4. Let w &#8712; R d 0 be a random vector with K-concentration property, X = (x 1 , . . . , x n ) &#8712; R d 0 &#215;n be a deterministic matrix. Define y = &#963; (w X) , where &#963; is &#955; &#963; -Lipschitz, and = Eyy . Let A be an n &#215; n deterministic matrix.</p><p>where C &gt; 0 is an absolute constant.</p><p>2. If E[y] = 0, for any t &gt; 0,</p><p>PROOF. Let f be any 1-Lipschitz convex function. Since y = &#963; (w X) , f (y) = f (&#963; (w X) ) is a &#955; &#963; X -Lipschitz function of w. Then by the Lipschitz concentration property of w in (41), we obtain</p><p>Therefore, y satisfies the K&#955; &#963; X -convex concentration property. Define f (x) = f (x -Ey), then f is also a convex 1-Lipschitz function and f (y) = f (y -Ey). Hence &#7929; := y -Ey also satisfies the K&#955; &#963; X -convex concentration property. Applying Theorem 3.3 to &#7929;, we have for any t &gt; 0,</p><p>.</p><p>Since E&#7929; = 0, the inequality above implies <ref type="bibr">(42)</ref>. Note that</p><p>Hence,</p><p>Since y (A + A )Ey is a (2 A Ey X &#955; &#963; )-Lipschitz function of w, by the Lipschitz concentration property of w, we have</p><p>Then combining <ref type="bibr">(43)</ref>, <ref type="bibr">(44)</ref>, and (45), we have</p><p>This finishes the proof.</p><p>Since the Gaussian random vector w &#8764; N (0, I d 0 ) satisfies the K-concentration inequality with K = &#8730; 2 (see, e.g., <ref type="bibr">[20]</ref>), we have the following corollary.</p><p>COROLLARY 3.5. Let w &#8764; N (0, I d 0 ), X = (x 1 , . . . , x n ) &#8712; R d 0 &#215;n be a deterministic matrix. Define y = &#963; (w X) , where &#963; is &#955; &#963; -Lipschitz, and = Eyy . Let A be an n &#215; n deterministic matrix.</p><p>for some absolute constant C &gt; 0.</p><p>2. If E[y] = 0, for any t &gt; 0,</p><p>where</p><p>REMARK 3.6. Compared to <ref type="bibr">[52]</ref>, Lemma 1, we identify the dependence on A F and Ey in the probability estimate. By using the inequality A F &#8804; &#8730; n A , we obtain a similar inequality to the one in <ref type="bibr">[52]</ref> with a better dependence on n. Moreover, our bound in t 0 is independent of d 0 , while the corresponding term t 0 in <ref type="bibr">[52]</ref>, Lemma 1, depends on X and d 0 . In particular, when E&#963; (&#958; ) = 0 and X is (&#949; n , B)-orthonormal, t 0 is of order 1. Hence, <ref type="bibr">(47)</ref> with the special choice of t 0 is the key ingredient in the proof of Theorem 2.3 to get a concentration of the spectral norm for CK. PROOF OF COROLLARY 3.5. We only need to prove <ref type="bibr">(47)</ref>, since other statements follow immediately by taking K = &#8730; 2. Let x i be the ith column of X. Then</p><p>Let &#958; &#8764; N (0, 1). We have</p><p>Therefore</p><p>and ( <ref type="formula">47</ref>) holds.</p><p>We include the following corollary about the variance of y Ay, which will be used in Section 5 to study the spectrum of the CK and NTK. COROLLARY 3.7. Under the same assumptions of Corollary 3.5, we further assume that t 0 &#8804; C 1 n, and A , X &#8804; C 2 . Then as n &#8594; &#8734;,</p><p>Thanks to Theorem 3.5 (2), we have that for any t &gt; 0,</p><p>where constant C &gt; 0 only relies on C 1 , C 2 , &#955; &#963; , and K. Therefore, we can compute the variance in the following way:</p><p>as n &#8594; &#8734;. Here, we use the dominant convergence theorem for the first integral in the last line.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.</head><p>Limiting law for general centered sample covariance matrices. Independent of the subsequent sections, this section focuses on the generalized sample covariance matrix where the dimension of the feature is much smaller than the sample size. We will later interpret such sample covariance matrix specifically for our neural network applications. Under certain weak assumptions, we prove the limiting eigenvalue distribution of the normalized sample covariance matrix satisfies two self-consistent equations, which are subsumed into a deformed semicircle law. Our findings in this section demonstrate some degree of universality, indicating that they hold across various random matrix models and may have implications for other related fields. -&#8594; 0, and</p><p>Then the empirical eigenvalue distribution of matrix A n weakly converges to &#956; almost surely, whose Stieltjes transform m(z) is defined by</p><p>for each z &#8712; C + , where &#179;(z) &#8712; C + is the unique solution to</p><p>In particular, &#956; = &#956; s &#956; . REMARK 4.2. In <ref type="bibr">[85]</ref>, it was assumed that d n 3 E|y D n y -Tr D n n | 2 &#8594; 0, where n 3 /d &#8594; &#8734; and n/d &#8594; 0 as n &#8594; &#8734;. By martingale difference, this condition implies <ref type="bibr">(49)</ref>. However, we are not able to verify a certain step in the proof of <ref type="bibr">[85]</ref>. Hence, we will not directly adopt the result of <ref type="bibr">[85]</ref> but consider a more general situation without assuming n 3 /d &#8594; &#8734;. The weakest conditions we found are conditions ( <ref type="formula">49</ref>) and <ref type="bibr">(50)</ref>, which can be verified in our nonlinear random model.</p><p>The self-consistent equations we derived are consistent with the results in <ref type="bibr">[16,</ref><ref type="bibr">85]</ref>, where they studied the empirical spectral distribution of separable sample covariance matrices in the regime n/d &#8594; 0 under different assumptions. When n &#8594; &#8734; and n/d &#8594; 0, our goal is to prove that the Stieltjes transform m n (z) of the empirical eigenvalue distribution of A n and &#179; n (z) := tr[R(z) n ] pointwisely converges to m(z) and &#179;(z), respectively.</p><p>For the rest of this section, we first prove a series of lemmas to get n-dependent deterministic equivalents related to (51) and ( <ref type="formula">52</ref>) and then deduce the proof of Theorem 4.1 at the end of this section. Recall</p><p>, and y is a random vector independent of A n with the same distribution of y i . </p><p>, where y j 's are independent copies of y defined in Theorem 4.1. Notice that, for any deterministic matrix D &#8712; R n&#215;n ,</p><p>Without loss of generality, we assume D &#8804; 1. Taking normalized trace, we have</p><p>For each 1</p><p>where the leave-one-out resolvent R (i) is defined as</p><p>.</p><p>Hence, by <ref type="bibr">(55)</ref>, we obtain</p><p>Combining equations ( <ref type="formula">54</ref>) and ( <ref type="formula">56</ref>), and applying expectation at both sides implies</p><p>because of the assumption that all y i 's have the same distribution as vector y for all i &#8712; [d +1].</p><p>With <ref type="bibr">(57)</ref>, to prove (53), we will first show that when n, d &#8594; &#8734;,</p><p>and spectral norms R , R(z) &#8804; 1/v due to Proposition C.2 in <ref type="bibr">[29]</ref>. Notice that n &#8804; C. Hence, we can deduce that</p><p>as n &#8594; &#8734;. The same argument can be applied to the error of</p><p>Therefore ( <ref type="formula">58</ref>) and ( <ref type="formula">59</ref>) hold. For (60), we denote &#7929; := y/(nd) 1/4 and observe that</p><p>&#7929; 2 &#181; &#955; i . Then, we can control the real part of &#7929; R(z)&#7929; by Lemma A.4: <ref type="bibr">(61)</ref> Re</p><p>We now separately consider two cases in the following:</p><p>&#8226; If the right-hand side of the above inequality (61) is at most 1/2, then</p><p>where we exploit the fact that (see also equation (A.1.11) in <ref type="bibr">[13]</ref>)</p><p>Finally, combining ( <ref type="formula">62</ref>) and <ref type="bibr">(63)</ref> in the above two cases, we can conclude the asymptotic result ( <ref type="formula">60</ref>) because E y 2 = Tr n &#8804; Cn in terms of the assumptions of Theorem 4.1.</p><p>Then with ( <ref type="formula">58</ref>), <ref type="bibr">(59)</ref>, and (60), we get</p><p>as n &#8594; &#8734;. We utilize the notion E y to clarify the expectation only with respect to random vector y, conditioning on other independent random variables. So the conditional expectation is E y [ 1 n y DR(z)y] = tr DR(z) n and</p><p>Therefore, based on (64), the conclusion (53) holds.</p><p>In the next lemma, we apply the quadratic concentration condition <ref type="bibr">(50)</ref> to simplify <ref type="bibr">(53)</ref>. </p><p>for each z &#8712; C + and any deterministic matrix D with D &#8804; C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>PROOF. Let us denote</head><p>In other words, &#181; n can be expressed by</p><p>.</p><p>Observe that</p><p>Thus, &#181; n has the same expectation as the last term</p><p>, since we can first take the expectation for y conditioning on the resolvent R(z) and then take the expectation for R(z). Besides, notice that | Q1 |, | Q2 | &#8804; C v uniformly. Hence, n d Q2 converges to zero uniformly and there exists some constant C &gt; 0 such that (66) 1</p><p>for all large d and n. In addition, observe that</p><p>where &#7929; is defined in the proof of Lemma 4.3. In terms of ( <ref type="formula">62</ref>) and ( <ref type="formula">63</ref>), we verify that (67)</p><p>where C &gt; 0 is some constant depending on v. Next, recall that condition <ref type="bibr">(50)</ref> exposes that ( <ref type="formula">68</ref>)</p><p>as n &#8594; &#8734;. The first convergence is derived by viewing D n = R(z) and taking expectation conditional on R(z). To sum up, we can bound | n | based on ( <ref type="formula">66</ref>) and ( <ref type="formula">67</ref>) in the subsequent way:</p><p>Here, | tr n | &#8804; n and n is uniformly bounded by some constant. Then, by H&#246;lder's inequality, <ref type="bibr">(68)</ref> implies that E[| n |] &#8594; 0, as n approaching to infinity. This concludes  Likewise, in Lemma 4.5, we consider D = ( 3n (z</p><p>Because n is uniformly bounded, D &#8804; C U . In terms of Lemma A.6, we only need to provide a lower bound for the imaginary part of U . Observe that Im U = Im 3n (z) n + v Id v Id since &#955; min ( n ) &#8805; 0 and Im 3n (z) &gt; 0. Thus, D &#8804; Cv -1 for all n. Meanwhile, we have the equation 3n (z) n D = n -zD and hence,</p><p>So applying Lemma 4.5 again, we have another limiting equation tr D + 3n (z) &#8594; 0. In other words, <ref type="bibr">(71)</ref> lim</p><p>Thanks to the identity</p><p>we can modify ( <ref type="formula">70</ref>) and (71) to get (72) lim n,d&#8594;&#8734; " mn (z) + tr " 3n (z) n + z Id -1 = 0.</p><p>Since 3n (z) and mn (z) are uniformly bounded, for any subsequence in n, there is a further convergent sub-subsequence. We denote the limit of such sub-subsequence by &#179;(z) and m(z) &#8712; C + respectively. Hence, by ( <ref type="formula">71</ref>) and ( <ref type="formula">72</ref>), one can conclude</p><p>Because of the convergence of the empirical eigenvalue distribution of n , we obtain the fixed point equation <ref type="bibr">(52)</ref> for &#179;(z). Analogously, we can also obtain <ref type="bibr">(51)</ref> for m(z) and &#179;(z). The existence and the uniqueness of the solutions to (51) and ( <ref type="formula">52</ref>) are proved in <ref type="bibr">[15]</ref>, Theorem 2.1, and <ref type="bibr">[80]</ref>, Section 3.4, which implies the convergence of mn (z) and 3n (z) to m(z) and &#179;(z) governed by the self-consistent equations ( <ref type="formula">51</ref>) and ( <ref type="formula">52</ref>) as n &#8594; &#8734;, respectively. Then, by virtue of condition <ref type="bibr">(49)</ref> in Theorem 4.1, we know m n (z) -mn (z) a.s.</p><p>-&#8594; 0 and &#179; n (z) -3n (z) a.s.</p><p>-&#8594; 0. Therefore, the empirical Stieltjes transform m n (z) converges to m(z) almost surely for each z &#8712; C + . Recall that the Stieltjes transform of &#956; is m(z). By the standard Stieltjes continuity theorem (see, e.g., <ref type="bibr">[13]</ref>, Theorem B.9), this finally concludes the weak convergence of empirical eigenvalue distribution of A n to &#956;. Now we show &#956; = &#956; s &#956; . The fixed point equations ( <ref type="formula">51</ref>) and ( <ref type="formula">52</ref>) induce <ref type="bibr">(73)</ref> &#179; 2 (z) + 1 + zm(z) = 0, since &#179;(z) &#8712; C + for any z &#8712; C + . Together with (51), we attain the same self-consistent equations for the convergence of the empirical spectral distribution of the Wigner-type matrix studied in <ref type="bibr">[15]</ref>, Theorem 1.1. Define W n , the n-by-n Wigner matrix, as a Hermitian matrix with independent entries</p><p>The Wigner-type matrix studied in <ref type="bibr">[15]</ref>, Definition 1.2, is indeed 1</p><p>n . Hence, such Wigner-type matrix 1</p><p>n has the same limiting spectral distribution as A n defined in Theorem 4.1. Both limits are determined by self-consistent equations ( <ref type="formula">51</ref>) and <ref type="bibr">(73)</ref>.</p><p>On the other hand, based on <ref type="bibr">[5]</ref>, Theorem 5.4.5, 1</p><p>&#8730; n W n and n are almost surely asymptotically free, that is, the empirical distribution of { 1 &#8730; n W n , n } converges almost surely to the law of {s, d}, where s and d are two free noncommutative random variables (s is a semicircle element and d has the law &#956; ). Thus, the limiting spectral distribution &#956; of 1</p><p>is the free multiplicative convolution between &#956; s and &#956; . This implies &#956; = &#956; s &#956; in our setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Proofs of Theorem 2.1 and Theorem 2.2.</head><p>To prove Theorem 2.1, we first establish the following proposition to analyze the difference between Stieltjes transform of ( <ref type="formula">12</ref>) and its expectation. This will assist us to verify condition <ref type="bibr">(49)</ref> in Theorem 4.1. The proof is based on <ref type="bibr">[29]</ref>, Lemma E.6. PROPOSITION 5.1. Let D &#8712; R n&#215;n be any deterministic symmetric matrix with a uniformly bounded spectral norm. Following the notions in Theorem 2.1, assume X &#8804; C for some constant C and Assumption 1.2 holds. Let R(z) be the resolvent</p><p>, for any fixed z &#8712; C + . Then, there exist some constants s, n 0 &gt; 0 such that for all n &gt; n 0 and any t &gt; 0,</p><p>PROOF. Define function F : R d 1 &#215;d 0 &#8594; R by F (W ) := tr R(z)D. Fix any W, &#8712; R d 1 &#215;d 0 where F = 1, and let W t = W + t . We want to verify F (W ) is a Lipschitz function in W with respect to the Frobenius norm. First, recall</p><p>where the last two terms are deterministic with respect to W . Hence, vec( )</p><p>where is the Hadamard product, and &#963; is applied entrywise. Here we utilize the formula</p><p>Therefore, based on the assumption of D, we have vec( )</p><p>for some constant C &gt; 0. For the first term in the product on the right-hand side,</p><p>For the second term,</p><p>This holds for every such that F = 1, so F (W ) is C/ &#8730; n-Lipschitz in W with respect to the Frobenius norm. Then the result follows from the Gaussian concentration inequality for Lipschitz functions.</p><p>Next, we investigate the approximation of = E w [&#963; (w X) &#963; (w X)] via the Hermite polynomials {h k } k&#8805;0 . The orthogonality of Hermite polynomials allows us to write as a series of kernel matrices. Then we only need to estimate each kernel matrix in this series. The proof is directly based on <ref type="bibr">[34]</ref>, Lemma 2. The only difference is that we consider the deterministic input data X with the (&#949; n , B)-orthonormal property, while in Lemma 2 of <ref type="bibr">[34]</ref>, the matrix X is formed by independent Gaussian vectors. LEMMA 5.2. Recall the definition of 0 in <ref type="bibr">(11)</ref>. If X is (&#949; n , B)-orthonormal and Assumption 1.2 holds, then we have the spectral norm bound</p><p>where</p><p>PROOF. By Assumption 1.2, we know that</p><p>For any fixed t, &#963; (tx) &#8712; L 2 (R, ). This is because &#963; (x) &#8712; L 2 (R, ) is a Lipschitz function and by triangle inequality |&#963; (tx)&#963; (x)| &#8804; &#955; &#963; |tx -x|, we have, for &#958; &#8764; N (0, 1),</p><p>For 1 &#8804; &#945; &#8804; n, let &#963; &#945; (x) := &#963; ( x &#945; x) and the Hermite expansion of &#963; a can be written as</p><p>where the coefficient</p><p>where (&#958; &#945; , &#958; &#179; ) = (w u &#945; , w u &#179; ) is a Gaussian random vector with mean zero and covariance</p><p>.</p><p>By the orthogonality of Hermite polynomials with respect to and Lemma A.5, we can obtain</p><p>which leads to</p><p>For any k &#8712; N, let T k be an n-by-n matrix with (&#945;, &#179;)th entry</p><p>Specifically, for any k &#8712; N, we have</p><p>. At first, we consider twice differentiable &#963; in Assumption 1.2. Similar with <ref type="bibr">[34]</ref>, equation <ref type="bibr">(26)</ref>, for any &#949; &gt; 0 and |t -1| &#8804; &#949;, we take the Taylor approximation of &#963; (tx) at point x, then there exists &#951; between tx and x such that</p><p>Replacing x by &#958; and taking expectation, since &#963; is uniformly bounded, we can get</p><p>where constant C does not depend on k. As for piecewise linear &#963; , it is not hard to see <ref type="bibr">(79)</ref> E &#963; (t&#958; )&#963; (&#958; ) = E &#963; (&#958; )&#958; (t -1). Now, we begin to approximate T k separately based on ( <ref type="formula">77</ref>), <ref type="bibr">(78)</ref>, and <ref type="bibr">(79)</ref>. Denote diag(A) the diagonal submatrix of a matrix A.</p><p>(1) Approximation for k&#8805;4 (T kdiag(T k )). At first, we estimate the L 2 norm with respect to of the function &#963; &#945; . Recall that</p><p>Hence, &#963; &#945; L 2 is uniformly bounded with some constant for all large n. Next, we estimate the off-diagonal entries of T k when k &#8805; 4. From <ref type="bibr">(76)</ref>, we obtain that</p><p>when n is sufficiently large.</p><p>(2) Approximation for T 0 . Recall E[&#963; (&#958; )] = 0 and by Gaussian integration by part,</p><p>Then, we have</p><p>If &#963; is twice differentiable, then E[&#963; (&#958; )] = &#8730; 2&#950; 2 (&#963; ) as well. Thus, taking t = x &#945; in ( <ref type="formula">77</ref>) and ( <ref type="formula">79</ref>) implies that for any 1 &#8804; &#945; &#8804; n, <ref type="bibr">(11)</ref>. Then, <ref type="bibr">(83)</ref> </p><p>Hence the difference between T 0 and &#956;&#956; is controlled by</p><p>(3) Approximation for T k for k = 1, 2, 3. For 0 &#8804; k &#8804; 3, Assumption 1.4 and <ref type="bibr">(78)</ref> show that</p><p>when n is sufficiently large. Notice that T k = D k f k (X X)D k , where D k is the diagonal matrix. Hence, by <ref type="bibr">(86)</ref>,</p><p>And for k = 1, 2, 3, by the triangle inequality,</p><p>From Lemma A.1 in Appendix A, we have that</p><p>So the left-hand side of ( <ref type="formula">87</ref>) is bounded. Analogously, we can verify f 3 (X X) is also bounded. Therefore, we have</p><p>for some constant C and k = 1, 2, 3 when n is sufficiently large. (4) Approximation for k&#8805;4 diag(T k ). Since u &#945; u &#945; = 1, we know</p><p>.</p><p>First, by ( <ref type="formula">80</ref>) and ( <ref type="formula">81</ref>), we can claim that</p><p>Second, in terms of (86), we obtain</p><p>for k = 1, 2 and 3. Combining these together, we conclude that</p><p>In terms of approximations ( <ref type="formula">82</ref>), ( <ref type="formula">85</ref>), (88), and (89), we can finally manifest</p><p>for some constant C &gt; 0 as &#8730; n&#949; 2 n &#8594; 0. The spectral norm bound of is directly deduced by the spectral norm bound of 0 based on ( <ref type="formula">84</ref>) and ( <ref type="formula">87</ref>), together with (90). REMARK 5.3 (Optimality of &#949; n ). For general deterministic data X, our pairwise orthogonality assumption with rate n&#949; 4 n = o(1) is optimal for the approximation of by 0 in the spectral norm. If we relax the decay rate of &#949; n in Assumption 1.4, the above approximation may require including terms of higher-degree f k (X X) for k &#8805; 4 in 0 , which will lead to the invalidation of some of our following results and simplifications. Subsequent to the initial completion of our paper, this weaker regime has been considered in our follow-up work <ref type="bibr">[81]</ref>.</p><p>Next, we continue to provide an additional estimate for , but in the Frobenius norm to further simplify the limiting spectral distribution of . LEMMA 5.4. If Assumptions 1.2 and 1.4 hold, then has the same limiting spectrum as</p><p>PROOF. By the definition of b &#963; , we know that b &#963; = &#950; 1 (&#963; ). As a direct deduction of Lemma 5.2, the limiting spectrum of is identical to the limiting spectrum of 0 . To prove this lemma, it suffices to check the Frobenius norm of the difference between 0 and</p><p>By the definition of vector &#956; and the assumption of X, we have</p><p>For k = 2, 3, the Frobenius norm can be controlled by</p><p>""</p><p>Hence, as n &#8594; &#8734;, we have</p><p>Hence, lim spec is the same as lim spec</p><p>Moreover, the proof of Lemma 5.4 can be modified to prove <ref type="bibr">(40)</ref>, so we omit its proof. Now, based on Corollary 3.7, Proposition 5.1, Lemma 5.2, and Lemma 5.4, applying Theorem 4.1 for general sample covariance matrices, we can finish the proof of Theorem 2.1. PROOF OF THEOREM 2.1. Based on Corollary 3.7 and Proposition 5.1, we can verify the conditions ( <ref type="formula">49</ref>) and ( <ref type="formula">50</ref>) in Theorem 4.1. By Lemma 5.2 and Lemma 5.4, we know that the limiting eigenvalue distributions of and (1b 2 &#963; ) Id +b 2 &#963; X X are identical and is uniformly bounded. So the limiting eigenvalue distribution of denoted by &#956; is just</p><p>Hence, the first conclusion of Theorem 2.1 follows from Theorem 4.1. For the second part of this theorem, we consider the difference</p><p>where we employ Lemma 5.2 and the assumption d 1 &#949; 4 n = o(1). Thus, because of Lemma A.7,</p><p>) has the same limiting eigenvalue distribution as <ref type="bibr">(12)</ref>, &#956; s ((1b 2 &#963; ) + b 2 &#963; &#956; 0 ). This finishes the proof of Theorem 2.1.</p><p>Next, we move to study the empirical NTK and its corresponding limiting eigenvalue distribution. Similarly, we first verify that such NTK concentrates around its expectation and then simplify this expectation by some deterministic matrix only depending on the input data matrix X and nonlinear activation &#963; . The following lemma can be obtained from ( <ref type="formula">23</ref>) in Theorem 2.7. LEMMA 5.5. Suppose that Assumption 1.1 holds, sup x&#8712;R |&#963; (x)| &#8804; &#955; &#963; and X &#8804; B.</p><p>n n, where and 0 are defined in <ref type="bibr">(17)</ref> and <ref type="bibr">(18)</ref>, respectively, and C B is a constant depending on B.</p><p>PROOF. We can directly apply methods in the proof of Lemma 5.2. Notice that ( <ref type="formula">6</ref>) and ( <ref type="formula">8</ref>) imply</p><p>for any standard Gaussian random vector w &#8764; N (0, Id). Recall that <ref type="bibr">(19)</ref> defines the kth coefficient of Hermite expansion of &#963; (x) by &#951; k (&#963; ) for any k &#8712; N. Then, Assumption 1.2 indicates b &#963; = &#951; 0 (&#963; ) and a &#963; = &#8734; k=0 &#951; 2 k (&#963; ). For 1 &#8804; &#945; &#8804; n, we introduce &#966; &#945; (x) := &#963; ( x &#945; x) and the Hermite expansion of this function as</p><p>where (&#958; &#945; , &#958; &#179; ) = (w u &#945; , w u &#179; ) is a Gaussian random vector with mean zero and covariance <ref type="bibr">(74)</ref>. Following the derivation of formula <ref type="bibr">(75)</ref>, we obtain</p><p>For any k &#8712; N, let T k &#8712; R n&#215;n be an n-by-n matrix with (&#945;, &#179;) entry</p><p>We can write</p><p>Then, adopting the proof of (88), we can similarly conclude that</p><p>for some constant C and k = 0, 1, 2, when n is sufficiently large. Likewise, <ref type="bibr">(82)</ref> indicates</p><p>and a similar proof of (89) implies that</p><p>Based on these approximations, we can conclude the result of this lemma.</p><p>PROOF OF THEOREM 2.2. The first part of the statement is a straight consequence of (91) and Theorem 2.1. Denote by</p><p>Hence, (91) indicates B -A &#8594; 0 as n &#8594; &#8734;. This convergence implies that limiting laws of A and B are identical because of Lemma A.3. The second part is because of Lemma 5.2 and Lemma 5.6. From ( <ref type="formula">7</ref>) and ( <ref type="formula">17</ref>), E[H ] = + . Then almost surely,</p><p>as &#949; 4 n d 1 &#8594; 0 by the assumption of Theorem 2.2. Therefore, the limiting eigenvalue distribution of ( <ref type="formula">21</ref>) is the same as <ref type="bibr">(20)</ref>.</p><p>6. Proof of the concentration for extreme eigenvalues. In this section, we obtain the estimates of the extreme eigenvalues for the CK and NTK we studied in Section 5. The limiting spectral distribution of 1</p><p>) tells us the bulk behavior of the spectrum. An estimation of the extreme eigenvalues will show that the eigenvalues are confined in a finite interval with high probability. We first provide a nonasymptotic bound on the concentration of 1 d 1 Y Y under the spectral norm. The proof is based on the Hanson-Wright inequality we proved in Section 3 and an &#949;-net argument.</p><p>PROOF OF THEOREM 2.3. Recall notation in Section 1. Define</p><p>where y i = &#963; (w i X).</p><p>For any fixed z &#8712; S n-1 , we have</p><p>where</p><p>and column vector (y 1 , . . . , y d 1 ) &#8712; R nd 1 is the concatenation of column vectors y 1 , . . . , y d 1 . Then (y 1 , . . . ,</p><p>Notice that</p><p>Denote &#7929; = (y 1 , . . . , y d 1 ). With <ref type="bibr">(48)</ref>, we obtain</p><p>where the last line is from the assumptions on X and &#963; . When B = 0, applying <ref type="bibr">(47)</ref> to (92) implies P " (y 1 , . . . , y d 1 ) A z (y 1 , . . . ,</p><p>Let N be a 1/2-net on S n-1 with |N | &#8804; 5 n (see, e.g., <ref type="bibr">[78]</ref>, Corollary 4.2.13), then</p><p>Taking a union bound over N yields</p><p>to conclude</p><p>the upper bound in ( <ref type="formula">22</ref>) is then verified. When B = 0, we can apply <ref type="bibr">(46)</ref> and follow the same steps to get the desired bound.</p><p>By the concentration inequality in Theorem 2.3, we can get a lower bound on the smallest eigenvalue of the conjugate kernel 1 d 1 Y Y as follows.</p><p>LEMMA 6.1. Assume X satisfies n i=1 ( x i 2 -1) 2 &#8804; B 2 for a constant B &gt; 0, and &#963; is &#955; &#963; -Lipschitz with E&#963; (&#958; ) = 0. Then with probability at least 1 -4e -2n ,</p><p>PROOF. By Weyl's inequality <ref type="bibr">[5]</ref>, Corollary A.6, we have</p><p>Then (93) follows from <ref type="bibr">(22)</ref>.</p><p>The lower bound in (93) relies on &#955; min ( ). Under certain assumptions on X and &#963; , we can guarantee that &#955; min ( ) is bounded below by an absolute constant. LEMMA 6.2. Assume &#963; is not a linear function and &#963; (x) is Lipschitz. Then</p><p>PROOF. Suppose that sup{k &#8712; N : &#950; k (&#963; ) 2 &gt; 0} is finite. Then &#963; is a polynomial of degree at least 2 from our assumption, which is a contradiction to the fact that &#963; is Lipschitz. Hence, (94) holds. LEMMA 6.3. Assume Assumption 1.2 holds, &#963; is not a linear function, and X satisfies (&#949; n , B)-orthonormal property. Then,</p><p>REMARK 6.4. This bound will not hold when &#963; is a linear function. Suppose &#963; is a linear function, under Assumption 1.2, we must have &#963; (x) = x and = X X. Then we will not have a lower bound on &#955; min ( ) based on the Hermite coefficients of &#963; .</p><p>PROOF OF LEMMA 6.3. From Lemma 5.2, under our assumptions, we know that</p><p>where 0 is given by <ref type="bibr">(11)</ref>. Thus, &#955; min ( ) &#8805; &#955; min ( 0 ) -C B &#949; 2 n &#8730; n, and, from Weyl's inequality <ref type="bibr">[5]</ref>, Theorem A.5, we have</p><p>0 &#215;n , and each column of K k is given by the kth Kronecker product</p><p>Since &#963; is nonlinear and Lipschitz, (94) holds for &#963; . Therefore,</p><p>and (95) holds. Theorem 2.5 then follows directly from Lemma 6.1 and Lemma 6.3. Next, we move on to nonasymptotic estimations for NTKs. Recall that the empirical NTK matrix H is given by <ref type="bibr">(7)</ref> and the &#945;th column of S is defined by diag(&#963; (W x &#945; ))a, for 1 &#8804; &#945; &#8804; n, in <ref type="bibr">(8)</ref>.</p><p>The ith row of S is given by z i := &#963; (w i X)a i , and E[z i ] = 0, where a i is the ith entry of a. Define D &#945; = diag(&#963; (w &#945; X)a &#945; ), for 1 &#8804; &#945; &#8804; d 1 . We can rewrite (S S) (X X) as</p><p>Let us define L and further expand it as follows:</p><p>Here Z i is a centered random matrix, and we can apply matrix Bernstein's inequality to show the concentration of L. Since Z i does not have an almost sure bound on the spectral norm, we will use the following sub-exponential version of the matrix Bernstein inequality from <ref type="bibr">[77]</ref>. <ref type="bibr">LEMMA 6.5 ([77]</ref>, Theorem 6.2). Let Z k be independent Hermitian matrices of size n &#215; n. Assume</p><p>for any integer p &#8805; 2. Then for all t &#8805; 0,</p><p>PROOF OF THEOREM 2.7. From (97), EZ i = 0, and</p><p>where C 1 = &#955; 2 &#963; X 2 and where a i &#8764; N (0, 1) is the ith entry of the second layer weight a. Then</p><p>So we can take R = 2C 2 1 , a 2 = 8C 4 1 in (98) and obtain</p><p>.</p><p>Hence, L defined in (96) has a probability bound:</p><p>.</p><p>Under the assumption that d 1 &#8805; log n, we conclude that, with high probability at least 1n -7/3 ,</p><p>Thus, as a corollary, the two statements in Lemma 5.5 follow from (99). Meanwhile, since</p><p>the bound in <ref type="bibr">(24)</ref> follows from Theorem 2.3 and (99).</p><p>We now proceed to provide a lower bound of &#955; min (H ) from Theorem 2.7.</p><p>PROOF OF THEOREM 2.9. Note that from ( <ref type="formula">7</ref>), <ref type="bibr">(17)</ref>, and (96), we have</p><p>Then with Lemma 5.6, we can get</p><p>Therefore, from Theorem 2.7, with probability at least 1n -7/3 ,</p><p>Since &#963; is Lipschitz and nonlinear, we know &#963; (x) is not a linear function (including the constant function) and |&#963; (x)| is bounded. Suppose that &#963; (x) has finite many nonzero Hermite coefficients, &#963; (x) is a polynomial, then we get a contradiction. Hence, the Hermite coefficients of &#963; satisfy</p><p>This finishes the proof.</p><p>7. Proofs of Theorem 2.12 and Theorem 2.17. By definitions, the random matrix K n (X, X) is 1 d 1 Y Y and the kernel matrix K(X, X) = is defined in <ref type="bibr">(3)</ref>. These two matrices have already been analyzed in Theorem 2.3 and Theorem 2.5, so we will apply these results to estimate how great the difference between training errors of random feature regression and its corresponding kernel regression. PROOF OF THEOREM 2.12. Denote K &#955; := (K + &#955; Id). From the definitions of training errors in <ref type="bibr">(31)</ref> and <ref type="bibr">(32)</ref>, we have</p><p>(100)</p><p>Here, in (100), we employ the identity (101)</p><p>for A = (K(X, X) + &#955; Id) -2 and B = (K n (X, X) + &#955; Id) -2 , and the fact that (K(X, X) + &#955; Id) -1 &#8804; &#955; -1 min (K(X, X)) and (K n (X, X) + &#955; Id) -1 &#8804; &#955; -1 min (K n (X, X)). Next, before providing uniform upper bounds for &#955; -2 min (K(X, X)) and &#955; -2 min (K n (X, X)) in (100), we can first get a bound for the last term of (100) as follows:</p><p>for some constant C &gt; 0, with probability at least 1 -4e -2n , where the last bound in (102) is due to Theorem 2.3 and Lemma A.9 in Appendix A. Additionally, combining Theorem 2.3 and Theorem 2.5, we can easily get (103)</p><p>for all large n and some universal constant C, under the same event that (102) holds. Theorem 6.3 also shows &#955; -1 min (K(X, X)) &#8804; C for all large n. Hence, with the upper bounds for &#955; -2 min (K(X, X)) and &#955; -2 min (K n (X, X)), <ref type="bibr">(33)</ref> follows from the bounds of (100) and (102).</p><p>For ease of notation, we denote K := K(X, X) and K n := K n (X, X). Hence, from (34), we can further decompose the test errors for K and K n into</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Let us denote</head><p>As we can see, to compare the test errors between random feature and kernel regression models, we need to control |E 1 -&#274;1 | and |E 2 -&#274;2 |. First, it is necessary to study the concentrations of</p><p>LEMMA 7.1. Under Assumption 1.2 for &#963; and Assumption 2.14 for x and X, with probability at least 1 -4e -2n , we have</p><p>where C &gt; 0 is a universal constant. Here, we only consider the randomness of the weight matrix in K n (x, X) defined by <ref type="bibr">(28)</ref> and <ref type="bibr">(29)</ref>.</p><p>PROOF. We consider X = [x 1 , . . . , x n , x], its corresponding kernels K n ( X, X), and K( X, X) &#8712; R (n+1)&#215;(n+1) . Under Assumption 2.14, we can directly apply Theorem 2.3 to get the concentration of K n ( X, X) around K( X, X), namely,</p><p>with probability at least 1 -4e -2n . Meanwhile, we can write K n ( X, X) and K( X, X) as block matrices:</p><p>Since the 2 -norm of any row is bounded above by the spectral norm of its entire matrix, we complete the proof of (106).</p><p>LEMMA 7.2. Assume that training labels satisfy Assumption 2.13 and X &#8804; B, then for any deterministic A &#8712; R n&#215;n , we have</p><p>F , where constant c only depends on &#963; &#179; &#179; &#179; , &#963; ", and B. Moreover,</p><p>PROOF. We follow the idea in Lemma C.8 of <ref type="bibr">[56]</ref> to investigate the variance of the quadratic form for the Gaussian random vector by <ref type="bibr">(108)</ref> Var</p><p>F , for any deterministic square matrix A and standard normal random vector g. Notice that the quadratic form</p><p>where g is a standard Gaussian random vector in R d 0 +n . Similarly, the second quadratic form can be written as</p><p>and similarly &#195;2 F &#8804; c A 2 F for a constant c, we can complete the proof.</p><p>As a remark, in Lemma 7.2, for simplicity, we only provide a variance control for the quadratic forms to obtain convergence in probability in the following proofs of Theorems 2.16 and 2.17. However, we can apply Hanson-Wright inequalities in Section 3 to get more precise probability bounds and consider non-Gaussian distributions for &#179; &#179; &#179; * and ".</p><p>PROOF OF THEOREM 2.16. Based on the preceding expansions of L( f (RF ) &#955; (x)) and L( f (K) &#955; (x)) in ( <ref type="formula">104</ref>) and (105), we need to control the right-hand side of</p><p>In the subsequent procedure, we first take the concentrations of E 1 and E 2 with respect to normal random vectors &#179; &#179; &#179; * and ", respectively. Then, we apply Theorem 2.3 and Lemma 7.1 to complete the proof of <ref type="bibr">(35)</ref>. For simplicity, we start with the second term</p><p>where I 1 and I 2 are quadratic forms defined below</p><p>and their expectations with respect to random vectors &#179; &#179; &#179; * and " are denoted by</p><p>We first consider the randomness of the weight matrix in K n and define the event E where both (103) and (107) hold. Then, Theorem 2.5 and the proof of Lemma 7.1 indicate that event E occurs with probability at least 1 -4e -2n for all large n. Notice that E does not rely on the randomness of test data x.</p><p>We now consider A = E x [x(K n (x, X) -K(x, X))](K n + &#955; Id) -1 in Lemma 7.2. Conditioning on event E, we have</p><p>for some constant C, where we utilize the assumption E[ x 2 ] = 1. Hence, based on Lemma 7.2, we know Var ",&#179; &#179; &#179; * (I 1 ) &#8804; cn/d 1 , for some constant c. By Chebyshev's inequality and event E , (111)</p><p>), we can apply ( <ref type="formula">101</ref>) and Moreover, conditioning on the event E ,</p><p>for some constant C. In the same way, with (112</p><p>Analogously, the first term | &#274;1 -E 1 | is controlled by the following four quadratic forms</p><p>where we define by J i := y A i y for 1 &#8804; i &#8804; 4 and</p><p>Similarly with (110) and (113), it is not hard to verify</p><p>d 1 conditioning on the event E . Then, like (111), we can invoke Lemma 7.2 for each A i to apply Chebyshev's inequality and conclude | &#274;1 -E 1 | = o((n/d 1 ) 1/2-&#949; ) with probability 1o(1) when d 1 /n &#8594; &#8734;, for any &#949; &#8712; (0, 1/2). LEMMA 7.3. With Assumptions 1.2 and 2.14, for (&#949; n , B)-orthonormal X, we have that</p><p>for some constant C &gt; 0. PROOF. By Lemma A.8, we have an entrywise approximation</p><p>Then, we can verify (114) based on the following approximation</p><p>for some universal constant C. The same argument can also be employed to prove (115), so details will be omitted here.</p><p>PROOF OF THEOREM 2.17. From ( <ref type="formula">33</ref>) and ( <ref type="formula">35</ref>), we can easily conclude that train under the ultra-wide regime.</p><p>Recall that K &#955; = (K + &#955; Id) and the test error is given by</p><p>where</p><p>The spectral norm of K &#955; is bounded from above and the smallest eigenvalue is bounded from below by some positive constants.</p><p>We first focus on the last two terms L 1 and L 2 in the test error. Let us define </p><p>To obtain the last approximation, we define K(X, X) := b 2 &#963; X X + (1b 2 &#963; ) Id and (119) K&#955; := b 2 &#963; X X + " 1 + &#955;b 2 &#963; Id . We aim to replace K &#955; by K&#955; in L1 and L2 . Recalling the identity (101), we have</p><p>K(X, X) -K(X, X) K -1 &#955; . Since &#963; is not a linear function, 1b 2 &#963; &gt; 0. Then, with (103), the proof of Lemma 5.4 indicates</p><p>where we apply the fact that &#955; min ( K(X, X)) &#8805; 1b 2 &#963; &gt; 0. Let us denote Then, with the help of (120) and uniform bounds of the spectral norms of X X, K -1 &#955; and K-1 &#955; , we obtain that</p><p>as n &#8594; &#8734;, n/d 0 &#8594; &#180;and n&#949; 4 n &#8594; 0. Combining all the approximations, we conclude that L i and L 0 i have identical limits in probability for i = 1, 2. On the other hand, based on the assumption of X and definitions in (119), (121), and (122), it is not hard to check that</p><p>Therefore, L 1 and L 2 converge in probability to the above limits, respectively, as n &#8594; &#8734;. In the end, we apply the concentration of the quadratic form &#179; &#179; &#179; * &#179; &#179; &#179; * in (118) to get 1 d 0 &#179; &#179; &#179; * 2 P -&#8594; &#963; 2 &#179; &#179; &#179; . Then, by (117), we can get the limit in <ref type="bibr">(38)</ref> for the test error L( f (RF ) &#955;</p><p>). As a byproduct, we can even use L 0 1 and L 0 2 to form an n-dependent deterministic equivalent of L( f (RF ) &#955; ) as well.</p><p>Thanks to Lemma 7.2, the training error, E (K,&#955;) train = &#955; 2 n y K -2 &#955; y, analogously, concentrates around its expectation with respect to &#179; &#179; &#179; * and ", which is &#963; 2 &#179; &#179; &#179; &#955; 2 tr K -2 &#955; X X + &#963; 2 " &#955; 2 tr K -2 &#955; . Moreover, because of (120), we can further substitute K -2 &#955; by K-2 &#955; defined in (119). Hence, we know that, asymptotically, E (K,&#955;) train&#963; 2 &#179; &#179; &#179; &#955; 2 tr K-2 &#955; X X&#963; 2 " &#955; 2 tr K-2 &#955; P -&#8594; 0, where as n, d 0 &#8594; &#8734;,</p><p>&#963; ) 2 d&#956; 0 (x). The last two limits are due to &#956; 0 = lim spec X X as n, d 0 &#8594; &#8734;. Therefore, by (116), we obtain our final result (37) in Theorem 2.17. APPENDIX A: AUXILIARY LEMMAS LEMMA A.1 (Equation (3.7.9) in <ref type="bibr">[43]</ref>). Let A, B be two n &#215; n matrices, A be positive semidefinite, and A B be the Hadamard product between A and B. Then,</p><p>LEMMA A.2 (Sherman-Morrison formula, <ref type="bibr">[17]</ref>). Suppose A &#8712; R n&#215;n is an invertible square matrix and u, v &#8712; R n are column vectors. Then</p><p>LEMMA A.3 (Theorem A.45 in <ref type="bibr">[13]</ref>). Let A, B be two n &#215; n Hermitian matrices. Then A and B have the same limiting spectral distribution if A -B &#8594; 0 as n &#8594; &#8734;.</p><p>LEMMA A.4 (Theorem B.11 in <ref type="bibr">[13]</ref>). Let z = x + iv &#8712; C, v &gt; 0 and s(z) be the Stieltjes transform of a probability measure. Then | Re s(z)| &#8804; v -1/2 &#8730; Im s(z).</p><p>LEMMA A.5 (Lemma D.2 in <ref type="bibr">[60]</ref>). Let x, y &#8712; R d such that x = y = 1 and w &#8764; N (0, I d ). Let h j be the j th normalized Hermite polynomial given in <ref type="bibr">(1.5)</ref>. Then PROOF. When &#963; is twice differentiable in Assumption 1.2, this result follows from Lemma D.3 in <ref type="bibr">[29]</ref>. When &#963; is a piecewise linear function defined in case 2 of Assumption 1.2, the second inequality follows from (79) with t = x &#945; . For the first inequality, the Hermite expansion of &#945;&#179; is given by <ref type="bibr">(75)</ref>  "</p><p>for &#949; &#8712; (0, 1) and (&#949;, B)-orthonormal X. This completes the proof of this lemma.</p><p>With the above lemma, the proof of Lemma D.4 in <ref type="bibr">[29]</ref> yields the following lemma.</p><p>LEMMA A.9. Under the same assumptions as Lemma A.8, there exists a constant C such that K(X, X) &#8804; CB 2 . Additionally, with Assumption 2.14, we have K( X, X) &#8804; CB 2 . APPENDIX B: ADDITIONAL SIMULATIONS Figures 4 and 5 provide additional simulations for the eigenvalue distribution described in Theorem 2.1 with different activation functions and scaling. Here, we compute the empirical eigenvalue distributions of centered CK matrices in histograms and the limiting spectra in terms of self-consistent equations. All the input data X's are standard random Gaussian matrices. Interestingly, in Figure 5, we observe an outlier that emerges outside the bulk distribution for the piecewise linear activation function defined in case 2 of Assumption 1.2. The analysis of the emergence of the outlier, in this case, would be interesting for future work. Acknowledgments. Z.W. would like to thank Denny Wu for his valuable suggestions and comments. Both authors would like to thank Lucas Benigni, Ioana Dumitriu, and Kameron Decker Harris for their helpful discussion. Funding. Z.W. is partially supported by NSF DMS-2055340 and NSF DMS-2154099. This material is based upon work supported by the National Science Foundation under Grant No. DMS-1928930 while Y.Z. was in residence at the Mathematical Sciences Research Institute in Berkeley, California, during the Fall 2021 semester for the program "Universality </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>This linear-width regime is also known as the high-dimensional regime, while our ultra-wide regime is also called a highly overparameterized regime in literature, see<ref type="bibr">[56]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>As Remark 2.11 stated, when &#963; (x) = x, &#955; min of CK may be possibly vanishing. To include the linear activation function, we can alternatively assume that the ridge parameter &#955; is strictly positive and focus on random feature ridge regressions.</p></note>
		</body>
		</text>
</TEI>
