<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 Fall</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10423426</idno>
					<idno type="doi"></idno>
					<title level='j'>A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>L. Zhou</author><author>F. Koehler</author><author>P. Sur</author><author>D. J. Sutherland</author><author>N. Srebro</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss ℓ can control the test error under all Moreau envelopes of the loss ℓ . We use our finite-sample bound to directly recover the “optimistic rate” of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal ℓ2-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand’s well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Modern machine learning models often contain more parameters than the number of training samples. Despite the capacity to overfit, training these models without explicit regularization has been empirically shown to achieve good generalization performance <ref type="bibr">(Neyshabur et al. 2015;</ref><ref type="bibr">C. Zhang et al. 2017;</ref><ref type="bibr">Belkin et al. 2019</ref>). On the theoretical side, the study of minimal-norm interpolation has revealed fascinating phenomena that challenge traditional understandings of machine learning.</p><p>We now have a better understanding of how properties of the data distribution and algorithmic bias can affect generalization in high-dimensional linear regression. For example, data with a spiked covariance structure can ensure that the test error of ridge regression will be approximately constant once the regularization strength is small enough for the model to fit the signal <ref type="bibr">(Zhou et al. 2021;</ref><ref type="bibr">Tsigler and Bartlett 2020)</ref>, contradicting the classical U-shaped curve expected from arguments about the bias-variance tradeoff. Surprisingly, even when the signal is sparse, the risk of the minimal-`1 norm interpolator can be shown to converge much slower to the Bayes risk than the minimal-`2 &#8676; These authors contributed equally. norm interpolator in the junk feature setting <ref type="bibr">(Chatterji and Long 2021;</ref><ref type="bibr">Koehler et al. 2021)</ref>. In contrast, the minimal-`2 norm interpolator fails to achieve consistency in the isotropic setting, while the minimal-`1 norm interpolator is consistent with sparsity but suffers from an exponentially slow rate in the number of parameters d (G. <ref type="bibr">Wang et al. 2021;</ref><ref type="bibr">Muthukumar et al. 2020</ref>). However, we can still achieve the minimax rate with minimal-`p norm interpolators with p extremely close to 1 <ref type="bibr">(Donhauser et al. 2022</ref>).</p><p>In fact, many of the intriguing phenomena from the work above may be understood using the norm of a predictor; localized notions of uniform convergence have emerged as essential tools for doing so. Compared to other techniques, uniform convergence analyses can have the benefit of requiring neither particular proportional scaling regimes nor closed-form expressions for the learned model, since only an approximate estimate of its complexity is needed. Despite uniform convergence's potential for wider applicability, though, work in this area has mostly focused on linear regression settings with strong assumptions: that the conditional expectation of the label is linear with respect to the features, and that the residual has constant variance. In contrast, classical agnostic learning guarantees established by uniform convergence usually need only much weaker assumptions on the data distribution, and apply to a broader range of losses and function classes. For example, <ref type="bibr">Srebro et al. (2010)</ref> show that bounds with an "optimistic rate" hold generally for any smooth, nonnegative loss, though the hidden logarithmic factor in their result is too loose for explaining noisy interpolation; this was recently addressed by <ref type="bibr">Zhou et al. (2021)</ref> in the special case of well-specified linear regression.</p><p>In this work, we take a step further towards agnostic interpolation learning and consider a highdimensional generalized linear model (GLM) setting where the label is generated by a potentially misspecified model. We show a new generalization bound that allows us to use the Moreau envelopes of any continuous loss function as an intermediate tool. By optimizing over the smoothing parameter to balance the approximation and generalization errors, our general Moreau envelope theory yields sharp non-asymptotic generalization bounds in a wide variety of settings. Applying to linear regression with the square loss, we recover the optimistic rate of <ref type="bibr">Zhou et al. (2021)</ref> and show that it can more generally be extended to handle model misspecification, such as nonlinear trends and heteroskedasticity. The generality of our result comes from the fact that taking the Moreau envelope of the square loss only scales by a constant; this property alone suffices to obtain a generalization guarantee in terms of the original square loss. The squared hinge loss enjoys the same property, and hence a completely analogous argument shows an optimistic rate in that setting. Combined with an analysis of the margin, we show a novel consistency result for max-margin classifiers.</p><p>More generally, we apply the Moreau envelope theory to obtain a generic bound for any Lipschitz loss and smooth, nonnegative loss with sharp constants. Looking specifically at the test error of an Empirical Risk Minimizer (ERM), we show our generalization bound with localized Gaussian width will be asymptotically sharp even when overfitting is not necessarily benign, yielding a version of the asymptotic Moreau envelope framework for analyzing M-estimators (El <ref type="bibr">Karoui et al. 2013;</ref><ref type="bibr">Bean et al. 2013;</ref><ref type="bibr">Donoho and Montanari 2016;</ref><ref type="bibr">Thrampoulidis et al. 2018;</ref><ref type="bibr">Sur and Cand&#232;s 2019)</ref> but for the problem of generalization. Numerical simulations on a variety of feature distributions and label generating processes confirm the wide applicability of our theory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>The Moreau envelope has been useful in characterizing asymptotic properties of M-estimators in linear models <ref type="bibr">(Bean et al. 2013;</ref><ref type="bibr">El Karoui et al. 2013;</ref><ref type="bibr">Donoho and Montanari 2016;</ref><ref type="bibr">El Karoui 2018;</ref><ref type="bibr">Thrampoulidis et al. 2018</ref>) and logistic regression <ref type="bibr">(Sur and Cand&#232;s 2019;</ref><ref type="bibr">Sur et al. 2019;</ref><ref type="bibr">Cand&#232;s and Sur 2020;</ref><ref type="bibr">Salehi et al. 2019;</ref><ref type="bibr">Zhao et al. 2022)</ref>. This theory focuses on estimation and inference under proportional asymptotics, rather than generalization, and does not provide any non-asymptotic results. <ref type="bibr">Loureiro et al. 2021</ref> studied proportional asymptotics in a student-teacher model similar to the one we consider, and gave finite-sample bounds on the speed of convergence to the limit. Like other works using the CGMT framework, they give formulas directly in terms of the regularization parameter used in M-estimation, and they also give results to characterize the distribution of the output of the estimator. Our work does not study the properties of the M-estimator in this level of detail. Instead, our approach is fundamentally based on establishing a nonasymptotic generalization bound. This enables our results to apply beyond proportional scaling (which is crucial to analyze benign overfitting), and yield guarantees for predictors besides the exact optima of convex M-estimation objectives. Another difference is that the class of generalized linear models we consider is somewhat broader: we allow multi-index models for the teacher (which may also be possible with their approach) and our loss function does not need to be convex.</p><p>For linear regression, <ref type="bibr">Bartlett et al. (2020)</ref> identify nearly-matching necessary and sufficient conditions for the consistency of minimal-`2 norm interpolation; their subsequent work <ref type="bibr">(Tsigler and Bartlett 2020)</ref> shows generalization bounds for overparametrized ridge regression. Following their work, <ref type="bibr">Negrea et al. (2020)</ref> and <ref type="bibr">Zhou et al. (2020)</ref> explore the role of uniform convergence, including showing that uniformly bounding the difference between training error and test error fails to explain interpolation learning. <ref type="bibr">Zhou et al. (2020)</ref> argue, however, that uniform convergence of interpolators is sufficient to establish consistency in a toy example. <ref type="bibr">Koehler et al. (2021)</ref> extend their result to arbitrary data covariance and norm, recovering the benign overfitting conditions of <ref type="bibr">Bartlett et al. (2020)</ref> as well as proving novel consistency results for the minimal-`1 norm interpolator. Based on this uniform convergence framework, G. <ref type="bibr">Wang et al. (2021)</ref> establish tight bounds for the minimal-`1 norm interpolator under a sparse signal with isotropic data. Earlier work <ref type="bibr">(Ju et al. 2020;</ref><ref type="bibr">Chinot et al. 2020;</ref><ref type="bibr">Li and Wei 2021</ref>) also studied the minimal-`1 norm interpolator, without showing consistency. Though the minimal-`1 norm interpolator suffers from an exponentially slow rate, <ref type="bibr">Donhauser et al. (2022)</ref> show the minimal-`p norm interpolator can achieve faster rates with p close to 1. <ref type="bibr">Zhou et al. (2021)</ref> show a risk-dependent ("localized") bound that extends the uniform convergence of interpolators guarantee to predictors with arbitrary training loss, and used it to establish generalization for regularized estimators such as Ridge and LASSO. Our Moreau envelope theory builds on the techniques developed in this line of work to apply uniform convergence in the interpolation regime.</p><p>In terms of requirements on the data distribution, <ref type="bibr">Bartlett et al. (2020)</ref> and <ref type="bibr">Tsigler and Bartlett (2020)</ref> only require the feature vector to be sub-Gaussian, but assume a well-specified linear model for the conditional distribution of the label. The uniform convergence-based works also assume a well-specified linear model, but the assumptions are more restrictive in the sense that the marginal distribution of the feature needs to be exactly Gaussian because their proof techniques rely on the Gaussian Minimax Theorem (GMT). Our Moreau envelope theory's application to linear regression significantly relaxes the assumption on the label generating process, though it is still constrained by the Gaussian data assumption. Shamir (2022) also studies model misspecification in linear regression, but allows non-Gaussian features, and shows that benign overfitting does not necessarily occur in the most general setting, even with a spiked-covariance structure (see his Example 1).</p><p>For linear classification, <ref type="bibr">Muthukumar et al. (2021)</ref> analyze `2 max-margin classifier by connecting to minimal-norm interpolation in regression. Similarly, our analysis in the classification case depends on the fact the squared hinge loss goes through the same transformation as the square loss under smoothing by Moreau envelope. <ref type="bibr">Donhauser et al. (2022)</ref> prove generalization bounds for `p maxmargin classifiers in the isotropic setting and do not consider the spiked-covariance case. <ref type="bibr">Deng et al. (2021)</ref>, <ref type="bibr">Montanari et al. (2019)</ref>, and <ref type="bibr">Liang and Sur (2020)</ref> derive exact expressions for the asymptotic prediction risk of the `2 and `p (with p 2 [1, 2)) max-margin classifiers. Though their proof techniques also rely on the GMT, our approaches are drastically different. We use GMT in order to show uniform convergence for a class of predictors and establish a non-asymptotic bound, whereas their results are asymptotic and assume a proportional scaling limit. This is a key distinction, because overfitting usually cannot be benign with proportional scaling (e.g. <ref type="bibr">Donhauser et al. 2022, Proposition 1)</ref>. Similar lower bounds have also been shown in the context of linear regression <ref type="bibr">(Muthukumar et al. 2020;</ref><ref type="bibr">G. Wang et al. 2021;</ref><ref type="bibr">Zhou et al. 2020)</ref>. Some concurrent works have obtained consistency results for max-margin classification in the spiked covariance setting. In particular, the May 2022 version of the work by Shamir (2022) also studies convergence to the minimum of the squared hinge loss, and obtains consistency under conditions similar to the benign covariance condition of <ref type="bibr">Bartlett et al. (2020)</ref>. During preparation of this manuscript we learned of concurrent work by <ref type="bibr">Montanari et al.</ref>, not yet publicly available, which also studies consistency results for classification. Comparing our Corollary 3 to <ref type="bibr">Shamir (2022)</ref>, their result applies to some non-Gaussian settings, but in the Gaussian setting their result is not as general as ours. (Combining Assumptions 1 and 2 of Theorem 7 there, they require the norm of the data to be bounded, whereas our Corollary 3 applies even if o(n) eigenvalues of &#8963; grow arbitrarily quickly with n.) More conceptually, our result follows from a norm-based generalization bound that applies to all predictors and outside of the "benign overfitting" conditions, generalizing the result of <ref type="bibr">Koehler et al. (2021)</ref> and unlike the analysis of prior work.</p><p>3 Problem Formulation GLM setting. Given a continuous loss function f : R &#8677; R ! R and i.i.d. sample pairs (x i , y i ) from some data distribution D, we can learn a linear model ( &#373;, b) by minimizing the empirical loss Lf with the goal of achieving small population loss L f :</p><p>Multi-index model. We assume that the data distribution D over (x, y) is such that 1. x &#8672; N (0, &#8963;) is a centered Gaussian with unknown covariance matrix &#8963;.</p><p>2. There are unknown weight vectors w &#8676; 1 , ..., w &#8676; k 2 R d such that the &#8963; 1/2 w &#8676; i are orthonormal, a function g : R k+1</p><p>! R, and a random variable &#8672; &#8672; D &#8672; independent of x (not necessarily Gaussian) such that</p><p>We can assume that the distribution of x is centered without loss of generality since presence of a mean term simply corresponds to changing the bias term b:</p><p>We can also assume that &#8963; 1/2 w &#8676; 1 , ..., &#8963; 1/2 w &#8676; k are orthonormal without loss of generality since we have not imposed any assumption on the link function g. The multi-index model includes wellspecified linear regression, by setting k = 1 and g(&#8984;, &#8672;) = &#8984; + &#8672;. It also allows nonlinear trends and heteroskedasticity (such as the model in Figure <ref type="figure">1</ref>) by changing the definition of g. Since g need not be continuous, the label y can be binary, as in linear classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Moreau Envelope Generalization Theory</head><p>Our theory vitally depends on the Moreau envelope, defined as follows. Definition 1. The Moreau envelope of f : R &#8677; R ! R with parameter 2 R + is defined as the function f : R &#8677; R ! R given by</p><p>(3)</p><p>The Moreau envelope can be viewed as a smooth approximation to the original function f : in our parameterization, smaller corresponds to more smoothing. The map that outputs the minimizer u, known as the proximal operator, plays an important role in convex analysis <ref type="bibr">(Parikh and Boyd 2014;</ref><ref type="bibr">Bauschke, Combettes, et al. 2011)</ref>.</p><p>Our general theory, as stated in Theorem 1 below, essentially upper bounds the generalization gap between the population Moreau envelope L f and the original training loss Lf by the sum of two parts: a parametric component that can be controlled by the dimension k of the "meaningful" part of x, and a non-parametric component that can be controlled by a dimension-free complexity measure such as the Euclidean norm of the predictor. Typically, the first term is negligible since k is small, and the complexity of fitting all the noise is absorbed into the second term. More precisely, we introduce the following definitions to formalize separating out a low dimensional component: Definition 2. Under the model assumptions (2), define a (possibly oblique) projection matrix Q onto the space orthogonal to w &#8676; 1 , ..., w &#8676; k and a mapping from</p><p>We let &#8963; ? = Q T &#8963;Q denote the covariance matrix of Q T x. We also define a low-dimensional surrogate distribution D over R k+1 &#8677; R by x &#8672; N (0, I k+1 ), &#8672; &#8672; D &#8672; , and &#7929; = g(x 1 , ..., xk , &#8672;).</p><p>(5)</p><p>This surrogate distribution compresses the "meaningful part" of x while maintaining the test loss, as shown by our main result Theorem 1 (proved in Appendix D). Note that as a non-asymptotic statement, the functions &#9999; , and C only need hold for a specific choice of n and D.</p><p>Theorem 1. Suppose 2 R + satisfies that for any 2 (0, 1), there exists a continuous function &#9999; , : R k+1 ! R such that with probability at least 1 /4 over independent draws (x i , &#7929;i ) from the surrogate distribution D defined in (5), we have uniformly over all ( w, b) 2 R k+2 that</p><p>Further, assume that for any 2 (0, 1), there exists a continuous function C : R d ! [0, 1] such that with probability at least 1 /4 over x &#8672; N (0, &#8963;), uniformly over all w 2 R d ,</p><p>Then it holds with probability at least 1 that uniformly over all (w, b) 2 R d+1 , we have</p><p>If we additionally assume that (6) holds uniformly for all 2 R + , then (8) does as well.</p><p>As we will see, we can generally bound the difference between L f and L f when the loss is assumed to be Lipschitz. If f is not Lipschitz but smooth (i.e. rf is Lipschitz, as for the squared loss), we can always write it as the Moreau envelope of another function f . In the special case of square loss or squared hinge loss, the Moreau envelope f is proportional to f , meaning that (8) becomes a generalization guarantee in terms of L f . Optimizing over will establish optimal bounds that recover the result of <ref type="bibr">Koehler et al. (2021)</ref> and <ref type="bibr">Zhou et al. (2021)</ref>, and lead to other novel results.</p><p>Remark 1. The complexity functional C (w) should be thought of as a localized, high-probability version of Rademacher complexity. This is because the Gaussian width of a convex set K, E sup w2K hw, xi, is the same as the Rademacher complexity of the class of linear functions {x 7 ! hw, xi : w 2 K} <ref type="bibr">(Zhou et al. 2021</ref>, Proposition 1). A somewhat similar complexity functional appears in <ref type="bibr">Panchenko (2003)</ref>. Also, note (6) requires only one-sided concentration -see Remark 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">VC Theory for Low-dimensional Concentration</head><p>To apply our generalization result (Theorem 1), we should check the low-dimensional concentration assumption (6). The quantitative bounds in the low-dimensional concentration (i.e. the precise form of error term &#9999; , ) will inevitably depend on the exact setting we consider (see e.  </p><p>The assumption (9) is standard (indeed, this is the setting primarily focused on in <ref type="bibr">Vapnik 1982)</ref> and is sometimes referred to as hypercontractivity or norm equivalence in the literature; a variant of the result holds with 4 replaced by 1 + &#9999;. In many settings of interest, this can be directly checked using the fact that x is Gaussian (for instance, see Theorem 9 and Appendix E. The last conclusion (uniformity over ) follows by going through the proof of Theorem 2, since it is based on reduction to uniform control of indicators. In every situation we will consider, it is easy to check that the VC dimension h in the theorem statement is O(k), generally by reducing to the fact that halfspaces in R k have VC dimension k + 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Applications</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Linear Regression with Square Loss</head><p>In this section, we show how to recover optimistic rates <ref type="bibr">(Zhou et al. 2021)</ref> for linear regression without assuming the model is well-specified. We will consider the square loss, f (&#375;, y) = (&#375; y) 2 . A key property of the square loss is that the Moreau envelope is proportional to itself:</p><p>Thus we can multiply by (1 + )/ in our generalization bound and solve for the optimal choice of .</p><p>Corollary 2. Suppose f is the square loss and the surrogate distribution D satisfies assumption (9) uniformly over (w, b) 2 R k+1 , then with probability at least 1 , uniformly over all w, b we have</p><p>As mentioned earlier, assumption (9) usually holds under mild conditions on y. For example, &#8999; can be chosen to be a constant when y is a bounded-degree polynomial of a Gaussian due to Gaussian hypercontractivity (O'Donnell 2014, Section 11.1). Specializing Corollary 2 to interpolators ( Lf = 0) recovers the uniform convergence of interpolators guarantee from <ref type="bibr">Koehler et al. (2021)</ref>.</p><p>Combined with a more general norm analysis in Section 6, we establish `2 benign overfitting with misspecification. In the well-specified case, see <ref type="bibr">Zhou et al. (2021)</ref> for detailed examples on ordinary least squares, ridge regression, and LASSO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Classification with Squared Hinge Loss</head><p>In this section, we show a novel optimistic rate bound for max-margin linear classification with the squared hinge loss, f (&#375;, y) = max(0, 1 y &#375;) 2 . Its Moreau envelope is given by</p><p>We consider the case of a general binary response y valued in {&#177;1} satisfying the model assumptions in equation ( <ref type="formula">2</ref>). In this case as well, f is proportional to its Moreau envelope; thus, the same proof as for the squared loss shows that Corollary 2 continues to hold when square loss is replaced by squared hinge loss! In Appendix E.3.1, we discuss certain settings (including noisy settings) where minimizing the squared hinge loss also minimizes the zero-one loss, i.e. the misclassification rate. </p><p>For regression, the number of leading eigenvalues is k = 3, and the label is generated according to y = 1.5 x 1 + |x 1 | cos x 2 + x 3 &#8226; N (0, 0.5). For classification, the number of leading eigenvalues is k = 1 and Pr(y = 1 | x) = sigmoid(5x 1 + 3).</p><p>The risk bound is calculated using the expression p L + p kwk 2 2 Tr(&#8963; ? )/n 2 , which corresponds to the choice of C(w) from Lemma 1, and is expected to be tight in the junk features setting. In the isotropic case, we use an easy improvement of the bound where w is projected to the orthogonal complement of the Bayes predictor w &#8676; <ref type="bibr">(Zhou et al. 2021</ref>). In the other cases, we use covariance splitting without projection of w. Each point on the curve is the average from repeated experiments, and shaded areas correspond to 95% bootstrap confidence intervals.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Further applications</head><p>In this section, we discuss further interesting examples where our general theory applies. As before, we can obtain a generalization bound by appealing to the general Corollary 1: we omit stating the formal corollary for each case, and simply show the Moreau envelope and its consequences.</p><p>L 1 loss (LAD) and Hinge. For the L 1 loss f (&#375;, y) = |&#375; y| its Moreau envelope is given by</p><p>which is 2 times the Huber loss with parameter = 1 2 . Therefore, the population Huber loss is controlled by the empirical L 1 loss. Clearly, interpolators have zero training error under both L 1 and L 2 training losses. We already see that Corollary 2 implies (1 o(1)) E(hw, xi+b y) 2 &#63743; C (w) 2 /n. Now, considering Corollary 1 with f the L 1 loss and using the above Moreau envelope calculation, we can check that when ! 0 we reproduce the exact same bound, since the Huber loss becomes the squared loss in the limit. Further insight into this phenomenon appears later in Theorem 4: the Huber loss naturally shows up when computing the training error of the LAD estimator. An entirely analogous situation occurs when f is the hinge loss f (&#375;, y) := max(0, 1 &#375;y): its Moreau envelope f will be a rescaling of the Huber hinge loss (c.f. T. Zhang 2004a).</p><p>Lipschitz loss: improved contraction. If f is M -Lipschitz in &#375;, Proposition 3.4 of Planiden and X. <ref type="bibr">Wang (2019)</ref> gives that 0 &#63743; f f &#63743; M 2 4 . Thus, assuming k/n = o(1), Corollary 1 implies</p><p>; optimizing over yields</p><p>For w 2 K, a high-probability upper bound on the Rademacher complexity of the function class x 7 ! hw, xi upper bounds the second term (Remark 1). In comparison, the standard symmetrization and contraction argument <ref type="bibr">(Bartlett and Mendelson 2002</ref>) loses a factor of two. Note that if f is the L 1 loss, this applies with M = 1, and it can further be shown that the constant factor cannot be improved further (see Appendix E.4), but this bound is also not as tight as the version with Huber test loss.</p><p>Uniform Convergence of Interpolators for Smooth Losses. Suppose that f is an H-smooth (in the sense that | @ 2 f @ &#375;2 | &#63743; H) and convex function of &#375;. In addition, assume that the minimum of f (&#375;, y) is zero for any fixed y. Since f is H-smooth, there exists a function f such that f = fH/2 <ref type="bibr">(Bauschke, Combettes, et al. 2011, Corollary 18.18)</ref>. If w satisfies Lf (w) = 0, then L f (w) = 0 as well.<ref type="foot">foot_0</ref> Finally, by Corollary 2, if k/n = o(1) we have uniformly over all w such that Lf (w) = 0 that</p><p>which generalizes the main result of <ref type="bibr">Koehler et al. (2021)</ref> to arbitrary smooth losses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">`2 Benign Overfitting</head><p>We can obtain conditions for consistency, like those of <ref type="bibr">Bartlett et al. (2020)</ref>, by simply combining a norm-based uniform generalization bound with an upper bound on the norm of interpolators. <ref type="bibr">Koehler et al. (2021)</ref> used the same strategy, but our more powerful and general tools give better results:</p><p>1. For squared loss regression, we show that the assumption that the ground truth is generated by a linear model (i.e. well-specified), made in previous work, is not required. The same result holds under the much more general model<ref type="foot">foot_1</ref> assumption of Section 3.</p><p>2. We show an analogous result in the classification setting, replacing the squared loss with the squared hinge loss. In fact, the argument is exactly the same in the two cases: all we need to use is that f = 1+ f and that f is the square of a Lipschitz function.</p><p>First, the following lemma (essentially just the standard Rademacher complexity bound for the `2 ball) combines with Corollary 2 and its squared hinge loss version to give generalization bounds. These bounds are demonstrated in Figure <ref type="figure">1</ref> and<ref type="figure">Appendix B</ref>.</p><p>Lemma 1. In the setting of Theorem 1, letting &#8963; ? = Q T &#8963;Q, the following C (w) will satisfy (7):</p><p>Next, we provide a sufficient condition for a zero-training error predictor w to exist in an `2 ball. In the case of classification, this allows us to lower-bound the margin of the max-margin halfspace. Lemma 2. Suppose that f (&#375;, y) is either squared loss or squared hinge loss. Let (w ] , b ] ) 2 R d+1 be an arbitrary vector satisfying Qw ] = 0 and with probability at least 1 /4,</p><p>for some &#8674; 1 (w ] , b ] ) &gt; 0. Then for any &#8674; 2 2 (0, 1), provided</p><p>we have that with probability at least 1 that min kwk&#63743;B L f (w, b ] ) = 0 for B &gt; 0 defined by</p><p>We note that for any vector w ] , we have</p><p>) by Jensen's inequality over Q T x, so the assumption Qw ] = 0 in the lemma is always satisfied for the minimizer of L f (w, b).</p><p>Combining the norm bound Lemma 2 and the generalization bound Lemma 1 yields the following. </p><p>where &#8674; 3 &gt; 0 is defined by</p><p>i 2 and we recall &#8674; 1 (w ] , b ] ) from (13).</p><p>We now show that this formally implies convergence to the optimal test loss (i.e. consistency) under the `2 benign overfitting conditions (15) from <ref type="bibr">Bartlett et al. (2020)</ref> and Tsigler and Bartlett (2020): Corollary 3. Suppose that D n is a sequence of data distributions following our model assumptions (2), with k n such that y = g(&#8984; 1 , . . . , &#8984; kn , &#8672;), and projection operator Q n defined as in (4). Suppose f is either the squared loss or the squared hinge loss, and define</p><p>) is the population loss over distribution D n with loss f . Suppose that the hypercontractivity assumption (9) holds with some fixed &#8999; &gt; 0 for all D n . Define &#8963; n := E Dn [xx T ] and &#8963;</p><p>Then we have the following convergence in probability, as n ! 1:</p><p>where ( &#373;n , bn ) = arg min w2R d ,b2R: Lf (w,b)=0 kwk 2 is the minimum-norm interpolator, and Lf,n is the training error based on n i.i.d. samples from the distribution D n .</p><p>Note when applying Corollary 3, we have the flexibility to increase k n and shrink &#8963; ?</p><p>n by choosing additional weights w &#8676; i and letting the link function g ignore the extra components. Remark 2 (Flatness of the test loss along regularization path). Our method can easily show a slightly stronger statement: let ( &#373;n , bn ) 2 arg min kwk&#63743;Bn,b2R Lf,n (w ] n , b ] n ) such that, if there are multiple minima, we pick the one with smallest kwk. As long as B n kw ] n k, we still have ( <ref type="formula">16</ref>), and this is established uniformly over all sequences B n satisfying the constraint. Therefore, under the benign overfitting conditions we get consistency as long as we do not over-regularize the predictor. See Figure <ref type="figure">1</ref> for an experimental demonstration of the flatness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Training Error and Local Gaussian Width</head><p>Theorem 1 shows how to upper-bound the test error of a predictor (under the Moreau envelope loss) by its training error and an upper bound on the class complexity. The following theorem is the dual result, which upper-bounds the training error of the constrained ERM (Empirical Risk Minimizer) by the Moreau envelope and a complexity term. In particular, this general result is used to derive the norm bound for interpolators in Lemma 2 above. Theorem 4. Let K, B be bounded convex sets, and let f (&#375;, y) be convex in &#375;. Suppose that &#8999; is such that with probability at least 1 , for (x, &#7929;) n i=1 sampled i.i.d. from D we have Then with probability at least 1 2 , min w2K,b2B Lf (w, b) &#63743; &#8999; .</p><p>Note that the assumption (17) implicitly suggests a low-dimensional concentration assumption: we expect 1 n P n i=1 f (h w, xi + b 0 , y i ) to be approximately the test loss of ( w, b 0 ) under the surrogate distribution D. As we discuss more in Appendix F, combining this training error bound with the correct choice of C (w) in Theorem 1, which is essentially C (w) = E max w02 1 ( w)\K hx, Qw 0 i 2 ), yields a matching lower bound to (8) on the Moreau envelope test loss and so our generalization bound is asymptotically sharp. This establishes a non-asymptotic analogue of the existing asymptotic Moreau envelope theory (see Section 2), and recovers the special case of well-specified linear models <ref type="bibr">(Zhou et al. 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Discussion</head><p>In this work, we significantly extend the localized uniform convergence technique developed in the study of noisy interpolation to any loss function and label generating process under mild conditions. Though the application of Moreau envelope to study GLMs is not new in the statistical literature, our general theory establishes novel non-asymptotic generalization bounds in a wide variety of overparameterized settings. We believe the generality of our framework may allow further applications in other areas of statistics, such as robust statistics and high-dimensional inference.</p><p>As mentioned in Section 2, the applicability of our theory is still considerably limited by the Gaussian data assumption, required by our use of the Gaussian minimax theorem. It does appear experimentally that it may hold much more broadly (Appendix B); proving that this is the case could allow us to study kernel methods and bring us closer to a theoretical understanding of deep neural networks. Some work has been done in related settings to extend Gaussian-based results to broader distributions via universality arguments (e.g. <ref type="bibr">Hu and Lu 2022;</ref><ref type="bibr">Liang and Sur 2020;</ref><ref type="bibr">Montanari and Saeed 2022</ref>), but it is not yet clear how to apply those techniques to our general framework. The GMT formulation also does not allow for multi-class classification or two-layer networks, because of their vector-valued outputs. Overcoming these two challenges seems to be crucial avenues for future work.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Since f &#63743; f is nonnegative, f is nonnegative as well. When f (&#375;, y) = 0, there must be u" such that f (u", y)+ (u" &#375;)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>&lt; "; this is only possible if we have u" ! &#375;, implying since f is smooth that f (&#375;, y) = 0.If Lf (w) = 0, we must have for every i that f (&#375;i, yi) = 0; thus f (&#375;i, yi) = 0 and L f (w) = 0.2 Theorem</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>of concurrent work by<ref type="bibr">Shamir (2022)</ref> also proves benign overfitting for some misspecified models, but requires the very strong assumption n 2 k&#8963; ? kop ! 0 that fails to hold in several examples from<ref type="bibr">Bartlett et al. (2020)</ref>.</p></note>
		</body>
		</text>
</TEI>
