<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Robust implicit regularization via weight normalization</title></titleStmt>
			<publicationStmt>
				<publisher>IMA Journal of Numerical Analysis</publisher>
				<date>07/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10663162</idno>
					<idno type="doi">10.1093/imaiai/iaae022</idno>
					<title level='j'>Information and Inference: A Journal of the IMA</title>
<idno>2049-8772</idno>
<biblScope unit="volume">13</biblScope>
<biblScope unit="issue">3</biblScope>					

					<author>Hung-Hsu Chou</author><author>Holger Rauhut</author><author>Rachel Ward</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>Overparameterized models may have many interpolating solutions; implicit regularization refers to the hidden preference of a particular optimization method towards a certain interpolating solution among the many. A by now established line of work has shown that (stochastic) gradient descent tends to have an implicit bias towards low rank and/or sparse solutions when used to train deep linear networks, explaining to some extent why overparameterized neural network models trained by gradient descent tend to have good generalization performance in practice. However, existing theory for square-loss objectives often requires very small initialization of the trainable weights, which is at odds with the larger scale at which weights are initialized in practice for faster convergence and better generalization performance. In this paper, we aim to close this gap by incorporating and analysing gradient flow (continuous-time version of gradient descent) with weight normalization, where the weight vector is reparameterized in terms of polar coordinates, and gradient flow is applied to the polar coordinates. By analysing key invariants of the gradient flow and using Lojasiewicz’s Theorem, we show that weight normalization also has an implicit bias towards sparse solutions in the diagonal linear model, but that in contrast to plain gradient flow, weight normalization enables a robust bias that persists even if the weights are initialized at practically large scale. Experiments suggest that the gains in both convergence speed and robustness of the implicit bias are improved dramatically using weight normalization in overparameterized diagonal linear network models.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The success of learning algorithms trained with stochastic gradient descent (SGD) variants, dating back to the seminal AlexNet architecture <ref type="bibr">(Krizhevsky et al., 2012)</ref>-arguably initiating the 'deep learning revolution <ref type="bibr">'-and empirically verified comprehensively in Sutskever et al. (2013)</ref>, emphasizes the importance of incorporating simple momentum for achieving rapid convergence in neural network learning. Although more complex momentum (or acceleration) algorithms have been proposed <ref type="bibr">(Nesterov, 1983;</ref><ref type="bibr">Bubeck et al., 2015;</ref><ref type="bibr">Liu &amp; Wright, 2015;</ref><ref type="bibr">Van Scoy et al., 2017;</ref><ref type="bibr">Allen-Zhu, 2018;</ref><ref type="bibr">Cyrus et al., 2018;</ref><ref type="bibr">Jain et al., 2018</ref><ref type="bibr">, Jin et al., 2018)</ref> demonstrating faster convergence rates than plain SGD on standard classes of loss functions, Polyak's original simple momentum update <ref type="bibr">(Polyak, 1964)</ref> defies theory by remaining highly effective in practice and remains a popular choice for many applications. Despite several studies analyzing the performance of stochastic momentum methods <ref type="bibr">(Flammarion &amp; Bach, 2015;</ref><ref type="bibr">Loizou &amp; Richt&#225;rik, 2017;</ref><ref type="bibr">Gadat et al., 2018;</ref><ref type="bibr">Kidambi et al., 2018;</ref><ref type="bibr">Yan et al., 2018;</ref><ref type="bibr">Can et al., 2019;</ref><ref type="bibr">Gitman et al., 2019;</ref><ref type="bibr">Liu et al., 2020;</ref><ref type="bibr">Loizou &amp; Richt&#225;rik, 2020;</ref><ref type="bibr">Sebbouh et al., 2021)</ref> a gap persists between existing theoretical guarantees and their superior practical performance. We aim to bridge this gap by analyzing the properties of simple stochastic momentum methods in the context of quadratic optimization.</p><p>Given an n &#215; d matrix A and a length n vector b, the linear least squares problem min</p><p>is one of the most fundamental problems in optimization. One approach to solving (1.1) is Polyak's heavy ball momentum (HBM) <ref type="bibr">(Polyak, 1964)</ref>, also called 'standard' or 'classical' momentum. (HBM) updates the parameter estimate x k for the solution x * as</p><p>where &#945; k and &#946; k are the step-size and momentum parameters, respectively. This is equivalent to the update</p><p>The gradient of the objective (1.1) is easily computed to be &#8711;f (x) = A T (Ax-b), and when A T A has finite condition number &#954; = &#955; max /&#955; min , where &#955; max and &#955; min are the largest and smallest eigenvalues of A T A, (HBM) with properly chosen constant step-size and momentum parameters &#945; k = &#945; and &#946; k = &#946; provably attains the optimal linear rate</p><p>When &#946; k = 0, (HBM) reduces to the standard gradient descent algorithm which, with the optimal choice of step-sizes &#945; k , converges at a sub-optimal linear rate</p><p>On large-scale problems, computing the gradient &#8711;f (x) can be prohibitively expensive. For problems such as (1.1), it is common to replace applications of the gradient with a minibatch stochastic gradient</p><p>where S k contains B indices drawn independently with replacement from {1, 2, . . . , n} where, at each draw, p j is the probability that an index j is chosen. Note that this sampling strategy ensures E[&#8711;f S k (x)] = &#8711;f (x).</p><p>We denote by minibatch-heavy ball momentum (Minibatch-HBM) the following algorithm: starting from initial conditions x 1 = x 0 , iterate until convergence</p><p>( M i n i b a t c h -H B M )</p><p>In the case of (1.1), the minibatch stochastic gradient can be written</p><p>where a T j is the jth row of A and b j is the jth entry of b. A..u/p01o, 1.1 Throughout, we will assume that, for some &#951; &#8805; 1, the sampling probabilities are such that <ref type="bibr">( 1 . 3 )</ref> where &#8226; F represents the matrix Frobenious norm.</p><p>Re/a(2 1.2 If rows are sampled proportionally to their squared norm p j = a j 2 / A 2 F , we have &#951; = 1. If rows are sampled i.i.d. uniformly, p j = 1/n for j = 1, . . . , n, (1.3) is satisfied with &#951; = n max j a j 2 / A 2 F . Applied to the problem (1.1), plain SGD (B = 1, &#946; = 0) with an appropriately chosen step-size (p j = a j 2 / A 2 F ) is equivalent to the randomized Kaczmarz (RK) algorithm if importance weighted sampling (&#951; = 1) is used <ref type="bibr">(Needell et al., 2014)</ref>. The standard version of RK extends the original cyclic Kaczmarz algorithm <ref type="bibr">(Kaczmarz, 1937)</ref> and, as proved in <ref type="bibr">Strohmer &amp; Vershynin (2008)</ref>, converges in a number of iterations scaling with d &#954;, where &#954; = &#955; ave /&#955; min is the smoothed condition number of A T A, and &#955; ave and &#955; min are the average and smallest eigenvalues of A T A, respectively. Note the important relationship between the smoothed and the standard condition numbers:</p><p>This relationship implies that, when n &#8805; d &#8730; &#954;, RK at least matches (up to constants) the performance of (HBM). If &#954; &#954; or n + d &#8730; &#954; then RK can significantly outperform (HBM), at least in terms of total number of row products.</p><p>While the number of row products of RK is reduced compared to (HBM), the number of iterations required to converge is increased. In practice, running times are not necessarily directly associated with the number of row products, and instead depend on other factors such as communication and memory access patterns. These costs often scale with the number of iterations, so it is desirable to understand whether the iteration complexity of (Minibatch-HBM) can be reduced to that of (HBM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Contributions</head><p>We aim to make precise the observation that stochastic momentum affords acceleration in the minibatch setting. An illustration of this phenomenon is provided in Fig. <ref type="figure">1</ref>. Our main theoretical result is a proof Gradient descent with heavy ball momentum (HBM) allows for accelerated convergence over gradient descent (GD). SGD allows for lower per iteration costs, and the use of batching (Minibatch-SGD) reduces the variance of the iterates. While batching and momentum are often used simultaneously (Minibatch-HBM), convergence guarantees have remained elusive, even for quadratic objectives (1.1). In this paper we prove that, on such objectives, (Minibatch-HBM) converges linearly at the same rate as (HBM), provided the batch size is sufficiently large in a precise sense.</p><p>that the linear rate of convergence of (HBM) can be matched by (Minibatch-HBM), provided the batch size is larger than a critical size. Informally, this result can be summarized as follows:</p><p>T#eo(e/ 1.3 Consider (Minibatch-HBM) applied to a strongly convex quadratic objective (1.1) with stochastic gradients (1.2) whose sampling probabilities satisfy (1.3) with parameter &#951; &#8805; 1. Suppose that the minimizer x * satisfies Ax * = b. Then, with the same fixed step-size and momentum parameters &#945; k = &#945;, &#946; k = &#946; as (HBM), there exists a constant C &gt; 0 such that, if &#954; is sufficiently large and B &#8805; C&#951;d log(d) &#954;&#8730; &#954;, the (Minibatch-HBM) iterates converge in expected norm at least at a linear rate 1 -1/ &#8730; &#954;.</p><p>For a more precise statement of Theorem 1.3, see Corollary 3.4.</p><p>Re/a(2 1. <ref type="bibr">4</ref> The bound for B does not depend directly on n, and in many situations B n. In these cases, (Minibatch-HBM) offers a provable speedup over (HBM). Theorem 3.2 provides a more finegrained relationship between the momentum and step-size parameters than Theorem 1.3. In particular, it shows that it is always possible to take B &lt; n and have (Minibatch-HBM) converge similar to (HBM), albeit at the cost of slowed convergence compared to the rate of convergence of (HBM) with the optimal &#945;, &#946;.</p><p>Owing to the equivalence between SGD on convex quadratics and the RK algorithm <ref type="bibr">(Needell et al., 2014)</ref> our convergence guarantees give a provable iteration complexity &#8730; &#954; for RK-type algorithms. Our analysis method is quite general, and can be used to certify fast linear convergence for various forms of momentum beyond heavy ball momentum; in Appendix A we illustrate the generality of the approach by proving an analogous convergence result for a minibatch Nesterov's accelerated gradient method in the setting of linear regression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Literature review</head><p>In the remainder of this section, we provide an overview of state-of-art existing results for row-sampling methods for solving (consistent) linear squares problems. A summary is given in Table <ref type="table">1</ref>.</p><p>Randomized Kaczmarz. A number of improvements to the standard RK algorithm have been proposed. <ref type="bibr">Liu and Wright (Liu &amp; Wright, 2015)</ref> introduce an accelerated randomized Kaczmarz (ARK) method that, through the use of Nesterov's acceleration, can achieve a faster rate of convergence</p><p>Ta3&amp;e 1 Runtime comparisons for row-sampling iterative methods for solving a consistent linear least squares problem (1.1) to constant accuracy when A &#8712; R n&#215;d for large condition number &#954;. Constants and a logarithmic dependence on the accuracy parameter are suppressed. Here &#954; and &#954; are the regular and smoothed condition numbers of A T A. Due to practical considerations such as parallelization, data movement, caching, energy efficiency, etc., the real-world cost of an iteration does not necessarily scale linearly with the number of row products Algorithm Iterations # row prods/iter. References (HBM) &#8730; &#954; n Hestenes et al. (1952), Polyak (1964), Liesen &amp; Strako4 (2013) SGD/RK d &#954; 1 Strohmer &amp; Vershynin (2008), Needell et al. (2014) ARK d &#8730; &#954; 1 Liu &amp; Wright (2015) (Minibatch-HBM) &#8730; &#954; B d log(d) &#954;&#8730; &#954; (this paper)</p><p>compared to RK. However, their rate is still sub-optimal compared to the rate attained by (HBM). Moreover, ARK is less able to take advantage of potential sparsity in the data matrix A than the standard RK algorithm and (Minibatch-HBM). This issue is partially addressed by a special implementation of ARK for sparse matrices, but is still of concern for particularly sparse matrices.</p><p>Minibatching in the setting of the RK has been studied extensively in 'small' batch regimes <ref type="bibr">(Needell &amp; Tropp, 2014;</ref><ref type="bibr">Needell &amp; Ward, 2016;</ref><ref type="bibr">Moorman et al., 2020)</ref>. These works view minibatching as a way to reduce the variance of iterates and improve on the standard RK algorithm. In general, the bounds for the convergence rates for such algorithms are complicated, but can improve on the convergence rate of RK by up to a factor of B in the best case. This 'best case' improvement, however, can only be attained for small B; indeed, RK reduces to standard gradient descent in the deterministic gradient limit. In contrast to these works, we study minibatching in the RK method as a necessary algorithmic structure for unlocking the fast convergence rate of HBM. Minibatch-HBM. Several recent works provide theoretical convergence guarantees for (Minibatch-HBM). <ref type="bibr">Loizou &amp; Richt&#225;rik (2017</ref><ref type="bibr">, 2020)</ref> show that (Minibatch-HBM) can achieve a linear rate of convergence for solving convex linear regression problems. However, the linear rate they show is slower than the rate of (deterministic) (HBM) in the same setting. <ref type="bibr">Gitman et al. (2019)</ref> establish local convergence guarantees for the (Minibatch-HBM) method for general strongly convex functions and appropriate choice of the parameters &#945; k and &#946; k . <ref type="bibr">Liu et al. (2020)</ref> show that (Minibatch-HBM) converges as fast as SGD for smooth strongly convex and nonconvex functions. Under the assumption that the stochastic gradients have uniformly bounded variance, <ref type="bibr">Can et al. (2019)</ref> provide a number of convergence guarantees for stochastic HBM. In particular, it is shown that the same fast rate of convergence can be attained for full-batch quadratic objectives with bounded additive noise. The results of <ref type="bibr">Can et al. (2019)</ref> however do not apply to the setting of RK, where the variance of the stochastic gradients necessarily grows proportionally to the squared norm of the full gradient.</p><p>Jain <ref type="bibr">et al. (2018)</ref> demonstrate that (Minibatch-HBM) with a batch size of B = 1 provably fails to achieve faster convergence than SGD. They acknowledged that the favorable empirical results of (Minibatch-HBM), such as those found in <ref type="bibr">Sutskever et al. (2013)</ref>, should be seen as an 'artifact' of large minibatching, where the variance of stochastic gradients is sufficiently reduced that the deterministic convergence behavior of (HBM) dominates. In this paper, we aim to precisely quantify this observation by providing a characterization of the minimal batch size required for (Minibatch-HBM) to achieve fast linear convergence comparable to that of (HBM).</p><p>In concurrent work, <ref type="bibr">Lee et al. (2022)</ref> analyze the dynamics of (Minibatch-HBM) applied to quadratic objectives corresponding to a general class of random data matrices. Their results show that when the batch size is sufficiently large, (Minibatch-HBM) converges like its deterministic counterpart, but convergence is necessarily slower for smaller batch sizes. The batch size requirement of <ref type="bibr">Lee et al. (2022)</ref> is a factor of &#954; better than what we obtain (see Theorem 1.3). However, while our analysis makes no assumptions on A, <ref type="bibr">Lee et al. (2022)</ref> requires certain invariance assumptions on the singular vectors of A. It would be interesting to understand whether the extra factor of &#954; in our bound can be improved or whether is a necessary artifact of the lack of assumptions on A. Stochastic Nesterov's Accelerated Gradient (SNAG). Several recent works have analyzed the theoretical convergence properties of stochastic Nesterov's accelerated gradient (SNAG) methods and their variants in both strongly convex and nonconvex settings <ref type="bibr">(Ghadimi &amp; Lan, 2016;</ref><ref type="bibr">Can et al., 2019;</ref><ref type="bibr">Vaswani et al., 2019;</ref><ref type="bibr">Aybat et al., 2020)</ref>. <ref type="bibr">Can et al. (2019) and</ref><ref type="bibr">Aybat et al. (2020)</ref> demonstrate accelerated convergence guarantees of SNAG method variants to a neighborhood of the solution for problems with uniformly bounded noise. Additionally, Ghadimi &amp; Lan (2016) provide convergence guarantees for SNAG variants in nonconvex settings. <ref type="bibr">Vaswani et al. (2019)</ref> show that SNAG methods achieve accelerated convergence rates to the solution for over-parameterized machine learning models under the assumption of the strong gradient growth condition, where the &#8467; 2 norm of the stochastic gradients is assumed to be bounded by the norm of the gradient. Our contributions imply that the strong gradient growth conditions hold in the consistent under-parameterized least squares setting for the stochastic minibatch gradient, and demonstrating that this condition implies acceleration for (Minibatch-HBM), in addition to Nesterov momentum. Acceleration techniques have also been integrated with the variance reduction techniques to achieve optimal convergence rate guarantees for finite-sum problems <ref type="bibr">(Frostig et al., 2015;</ref><ref type="bibr">Defazio, 2016;</ref><ref type="bibr">Allen-Zhu, 2018;</ref><ref type="bibr">Lin et al., 2018;</ref><ref type="bibr">Zhou et al., 2019)</ref>. <ref type="bibr">Ma et al. (2018)</ref> established critical batch size for SGD to retain the convergence rate of deterministic gradient descent method where as in this work we establish the critical batch size for (Minibatch-HBM) to retain the convergence properties of (HBM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Notation</head><p>We denote vectors using lower case roman letters and matrices using upper case roman letters. We use x &#8712; R d to denote the variables of the optimization problem and x * to denote the minimizer of f (x). We use &#8226; to represent the Euclidean norm for vectors and operator norm for matrices and &#8226; F to represent the matrix Frobenious norm. Throughout, A will be an n &#215; d matrix and b a length n vector. The eigenvalues of A T A are denoted &#955; 1 , &#955; 2 , . . . , &#955; d , and we write &#955; max , &#955; min , and &#955; ave for the largest, smallest, and average eigenvalue of A T A. The regular and smoothed condition numbers &#954; and &#954; of A T A are respectively defined as &#954; = &#955; max /&#955; min and &#954; = &#955; ave /&#955; min . All logarithms are natural, and we denote complex numbers by i and Euler's number 2.718 . . . by e.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Preliminaries</head><p>In this section we provide an overview of (HBM) analysis and important statements from random matrix theory that are used in proving the convergence of (Minibatch-HBM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Standard analysis of heavy ball momentum for quadratics</head><p>We review the standard analysis for heavy ball momentum (HBM) in the setting of strongly convex quadratic optimization problems (1.1) (see <ref type="bibr">Recht (2010)</ref>). Here and henceforth, we take &#945; k = &#945; and &#946; k = &#946; as constants.</p><p>First, we re-write the (HBM) updates as</p><p>so, by definition, the (HBM) updates satisfy</p><p>This can be written more concisely as</p><p>where T is the transition matrix taking us from the error vectors at steps k and k -1 to the error vectors at steps k + 1 and k. Repeatedly applying this relation, we find</p><p>from which we obtain the error bound</p><p>The assumption x 1 = x 0 allows us to write</p><p>The difficulty in analyzing (HBM) compared to plain gradient descent (when &#946; = 0) lies in the fact T k need not provide a useful upper bound for T k . Indeed, while T k = T k for symmetric matrices, this is not necessarily the case for nonsymmetric matrices. To get around this issue, it is common to bound the spectral radius &#961;(T) = max j {|&#955; j |} and use Gelfand's formula &#961;(T) = lim k&#8594;&#8734; T k 1/k to derive the rate of convergence <ref type="bibr">(Recht, 2010)</ref>.</p><p>To bound the spectral radius of T, note that T is orthogonally similar to a block diagonal matrix consisting of 2 &#215; 2 components {T j }. Specifically,</p><p>, where</p><p>for each j = 1, 2, . . . , n, U is a certain orthogonal matrix (see Recht ( <ref type="formula">2010</ref>)), and {&#955; j } are the eigenvalues of A T A. For each j = 1, 2, . . . , n, the eigenvalues z &#177; j of T j are easily computed to be</p><p>and are therefore nonreal if and only if</p><p>In this case, the magnitude of both the eigenvalues of T j is</p><p>Here we have used that</p><p>2 whenever (2.5) holds. Therefore, provided</p><p>we have that</p><p>( 2 . 7 )</p><p>We would like to choose &#8730; &#946; = &#961;(T) as small as possible subject to the condition that (2.6) holds.<ref type="foot">foot_2</ref> Note that (2.6) is equivalent to the condition</p><p>which we can rewrite as</p><p>( 2 . 9 )</p><p>As noted at the start of this section, this gives an asymptotic rate of convergence &#8730; &#946;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">A closer look at the quadratic case</head><p>To derive a bound for stochastic (HBM) at finite k, it is desired to understand the eigendecomposition of the matrix T from (2.4) more carefully. Thus, we might aim to diagonalize T as</p><p>where U is the previously described orthogonal matrix rotating T into block diagonal form (2.4) and D and C are block-diagonal matrices with 2 &#215; 2 blocks {D j } and {C j } where, for each j = 1, . . . , d, T j is diagonalized as</p><p>Given such a factorization, we would have T k = (UC)D k (UC) -1 . Then, since that U is unitary, we would obtain the bound</p><p>is the condition number of the eigenvector matrix UC.</p><p>However, if &#945; and &#946; are chosen as in (2.9), T is defective (that is, does not have a complete basis of eigenvectors) and no such diagonalization exists. <ref type="foot">2</ref> To avoid this issue, we perturb the choices of &#945; and &#946;, and define, for some ( &#8712; (0, &#955; min ), L = &#955; max + ( and &#8467; = &#955; min -( .</p><p>( 2 . 1 1 )</p><p>ensures that (2.6) holds and that T is diagonalizable. Indeed, since we can write z &#177; j = a j &#177; ib j for a j , b j &#8712; R with b j = 0, it is easily verified that the (up to a scaling of the eigenvectors) eigendecomposition</p><p>We clearly have D = max j |z &#177; j | = &#8730; &#946;, so the (HBM) iterates satisfy the convergence guarantee</p><p>( 2 . 1 4 )</p><p>Note that M(&#945;, &#946;), L and &#8467; each depend on &#955; max , &#955; min and ( . The dependency of M(&#945;, &#946;) on these values is through C, which, up to a unitary scaling by U, is the eigenvector matrix of the transition matrix T. We can bound M(&#945;, &#946;) by the following lemma.</p><p>Le//a 2.1 For any ( &#8712; (0, &#955; min ) set &#8467; = &#955; min -( and L = &#955; max + ( and choose &#945; = 4/(</p><p>, where UC is the eigenvector matrix for T. Then,</p><p>&#945; ( (( + &#955; max&#955; min ) .</p><p>Proof. In order to bound M(&#945;, &#946;), we note that the block diagonal structure of C implies that C = max{ C j : j = 1, . . . , d} and C -1 = max{ C -1 j : j = 1, . . . , d}. By construction, the specified values of &#945; and &#946; ensure that (1+&#946; -&#945;&#955; i ) 2 &lt; 4&#946; for all i = 1, 2 . . . , d, i.e. condition (2.5). This implies |z &#177; j | = &#8730; &#946;, so we easily compute</p><p>By direct computation, we find</p><p>To bound C -1 j we first note that the condition (2.5) is also equivalent to</p><p>Since L &#8805; &#955; max we therefore have the bound</p><p>.</p><p>The result follows by combining the above expressions.</p><p>ON THE FAST CONVERGENCE OF MINIBATCH HBM 1407 2.3 Lemmas from nonasymptotic random matrix theory Before we prove our main result, we need to introduce tools from nonasymptotic random matrix theory which are crucial components of the proof. P(opo.101o, 2.2 Consider a finite sequence {W k } of independent random matrices with common dimension d 1 &#215; d 2 . Assume that E[W i ] = 0 and W i &#8804; W for each index i and introduce the random matrix</p><p>Let v(Z) be the matrix variance statistic of the sum:</p><p>Then,</p><p>The bound on E[ Z ] is Theorem <ref type="bibr">6.1.1 in Tropp (2015)</ref>, and the bound on E[ Z 2 ] follows from equation <ref type="bibr">6.1.6 in Tropp (2015)</ref> and the fact E[max i W i 2 ] &#8804; W. Equation <ref type="formula">6</ref>.1.6 in Tropp (2015) comes from applying Theorem A.1 in <ref type="bibr">Chen et al. (2012)</ref> to the Hermitian dilation of Z. Under the stated conditions, the logarithmic dependence on the dimension is necessary <ref type="bibr">(Tropp, 2015)</ref>.</p><p>We will also use a theorem on products of random matrices from Huang et al. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Main results</head><p>We are now prepared to analyze (Minibatch-HBM) applied to strongly convex least squares problems of the form (1.1). We begin by considering the case of consistent linear systems, i.e. systems for which b is in the column span of A. In Section 3.1 we then provide an analogous result for inconsistent least squares problems. We begin with a useful technical lemma that bounds the batch size required to ensure that a certain random matrix is near it's expectation.</p><p>Le//a 3.1 Define W j = B -1 (-p -1 j a j a T j + A T A) and let</p><p>where S is a list of B indices each chosen independently according to (1.3). Then,</p><p>Proof. Since the sampling probabilities satisfy (1.3), a j a T j = a j 2 &#8804; &#951;p j A 2 F . Then, since &#951; &#8805; 1,</p><p>Next, observe that</p><p>Using that E[(p j ) -1 a j a T j ] = A T A and a j 2 &#8804; &#951;p j A 2 F , we find that</p><p>Here we write M 1 M 2 if M 2 -M 1 is positive semi-definite. Note that M 1 M 2 implies the largest eigenvalue of M 2 is greater than the largest eigenvalue of M 1 . Therefore, using that 0 E W T j W j followed by the fact 0 A T A A 2 F I,</p><p>Thus, since W j is symmetric and the samples in S are i.i.d., we obtain a bound for the variance statistic</p><p>Together with (3.1) and (3.2), Theorem 2.2 implies</p><p>The first term is bounded by &#948;/2 when</p><p>whereas the second term is bounded by &#948;/2 when</p><p>The result follows by taking the max of these quantities.</p><p>Our main result is the following theorem.</p><p>T#eo(e/ 3.2 Consider (Minibatch-HBM) applied to a strongly convex quadratic objective (1.1) with stochastic gradients (1.2) whose sampling probabilities satisfy (1.3). Fix parameters &#945; and &#946; satisfying</p><p>Then, for all k &gt; 0, assuming that the minimizer x * satisfies Ax * = b, the (Minibatch-HBM) iterates satisfy</p><p>where M(&#945;, &#946;) is the condition number of eigenvector matrix UC for T defined in (2.10).</p><p>Re/a(2 3.3 For k &#8804; k * , (k * ) k/k * &#8804; max{e, k}. Thus, (Minibatch-HBM) converges at at rate nearly &#8730; &#946; for k &#8804; max{d, k * }. For k &gt; max{d, k * }, the convergence is still linear for k * sufficiently large, but at a rate slower than &#8730; &#946;. Specifically, (Minibatch-HBM) converges at a rate</p><p>Proof. Due to assumption of consistency, we have that Ax * = b. Therefore, we can write the minibatch gradient (1.2) as <ref type="bibr">. 4 )</ref> Define the random matrix</p><p>Then, analogously to (2.2), the (Minibatch-HBM) iterates satisfy the recurrence</p><p>where Y S k is the stochastic transition matrix at iteration k. After k iterations, the error satisfies</p><p>so our goal is to bound the norm of the random matrix</p><p>This is a product of random matrices, so we may hope to apply Proposition 2.3. However, while E[Y S i ] = T, where T is the deterministic transition matrix (2.2), E[Y S i ] is not necessarily bounded by &#8730; &#946;. Thus, to apply Proposition 2.3 we will instead consider</p><p>where</p><p>and U and C are the matrices from (2.10). Then, as desired,</p><p>Thus, if we can guarantee that the variances { E[ X S i -E[X S i ] 2 ]} are not too large, we can apply Proposition 2.3 to obtain a rate similar to (HBM). Towards this end, note that</p><p>where W j is as in Lemma 3.1. This and the fact that <ref type="bibr">(3.6)</ref> where W = j&#8712;S j W j . Using <ref type="bibr">Lemma 3.1 and (3.6)</ref>, we have E X S i -EX S i 2 &#8804; &#948; provided that the batch size B satisfies</p><p>Applying Proposition 2.3 to the product (3.5) with the parameters</p><p>gives the bound</p><p>Set &#948; 2 = &#946; log(k * )/(2k * ) so that 2v = k log(k * )/k * . This gives the desired expression for B. Moreover, we then have</p><p>Thus, we find that</p><p>giving the desired bound for the iterates.</p><p>The expressions for the required batch size in Theorem 3.2 is somewhat complicated, but can be simplified in the large condition number limit.</p><p>Co(o&amp;&amp;a(y 3.4 Fix c &#8712; (0, 2). There exist parameters &#945; and &#946; and a constant C &gt; 0 (depending on c) such that, for all &#954; sufficiently large, the (Minibatch-HBM) iterates converge in expected norm at least at a linear rate 1c/ &#8730; &#954; provided that B &#8805; C&#951;d log(d) &#954;&#8730; &#954;.</p><p>Proof. Suppose ( = c 1 &#955; min for c 1 &#8712; (0, 1). Then, using that the definitions of L and &#8467; from (2.11) imply that L/&#8467; = (&#955; max + c 1 &#955; min )/(&#955; minc 1 &#955; min ) = &#954;/(1c 1 ) + c 1 /(1c 1 ), we have</p><p>we have</p><p>Therefore, if we take c 1 = 1 -((c + 6)/8) 2 and c 2 = (2c)/4 we have that, for &#954; sufficiently large,</p><p>Using Lemma 2.1 we have that</p><p>This, with the fact that &#946; = O(1), implies</p><p>and</p><p>Thus, the bound on B becomes B &#8805; O(&#951;d log(d) &#954;&#8730; &#954;).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Inconsistent least squares problems</head><p>Our results can be extended to inconsistent systems. On such systems, the stochastic gradients at the optimal point x * need not equal zero, even though &#8711;f (x * ) = 0. As a result, stochastic gradient methods will only converge to within a convergence horizon of the minimizer x * .</p><p>Re/a(2 3. T#eo(e/ 3.6 In the setting of Theorem 3.2, define r = Ax * -b and ) = max i |r i |/ a i . Let the batch size B satisfy the conditions in Theorem 3.2. Then, provided k * is chosen so that &#948; = 2 log(k * )/(k * log(1/&#946;)) &lt; 1, the (Minibatch-HBM) iterates satisfy</p><p>Re/a(2 3.7 The term in R containing r scales with 1/ &#8730; B whereas the term containing ) scales with 1/B. Thus, when B is large, the term in R involving ) becomes small relative to the term involving r .</p><p>Re/a(2 3.8 In the large &#954; limit considered in <ref type="bibr">Corollary 3.4</ref></p><p>Proof of Theorem 3.6. Since b = Ax * -r, our stochastic gradients for inconsistent systems satisfy</p><p>where M S k is as in the proof of Theorem 3.2, and define</p><p>We therefore find the iterates satisfy the update formula</p><p>Thus, after k iterations, the error satisfies</p><p>The first term is identical to the case r = 0, so we have</p><p>Here we have used the triangle inequality and definition of operator norm followed by the linearity of expectation and independence of the minibatch draws across iterations.</p><p>As in the proof of Theorem 3.2,</p><p>Therefore, assuming all minibatch draws {S j } are identically distributed, we have</p><p>It remains to bound E[ r S 1 ], and to do so we again turn to the matrix Bernstein inequality. Define the sum</p><p>and note that, with ) = max i |r i |/ a i , and the assumption (1.3) on the sampling probabilities,</p><p>By direct computation we also observe that</p><p>Therefore, applying Theorem 2.2 we obtain the bound</p><p>Combining everything gives the desired result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Numerical results</head><p>We conduct numerical experiments on quadratic objectives (1.1) to illustrate the performance of (Minibatch-HBM). Throughout this section, we use the value</p><p>as a heuristic for the batch size needed to observe linear convergence at a rate similar to that of (HBM). This value is obtained from Theorem 3.2 by making several simplifying assumptions. Specifically, we drop the dependence on M(&#945;, &#946;) which results from a change of basis followed by a return to the original basis (which we believe is likely an artifact of our analysis approach) and replace k * / log(k * ) by 1/ log(1/&#946;) = O( &#8730; &#954;). In all experiments we use n = 10 6 , d = 10 2 and set ( = &#955; min /10 3 . Each experiment is repeated 100 times and the median and 5th to 95th percentile range for each algorithm/parameter choice are plotted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Row-norm and uniform sampling</head><p>In this example, we study the dependence of (Minibatch-HBM) on the sampling probabilities. Our bounds are sharpest when p j &#8733; a j 2 , but in practice it is common to use uniform sampling p j = 1/n to avoid the need for computing row-norms which requires accessing the entire data matrix. We take A = DG, where G is an n &#215; d matrix whose entries are independently 1 with probability 1/10 or 0 with probability 9/10 and D is an n &#215; n diagonal matrix whose diagonal entries are 1 with probability 9/10 and 10 with probability 1/10. Thus, uniform sampling probabilities p j = 1/n satisfy (1.3) provided &#951; &#8805; n max j a j 2 / A 2 F &#8776; 23. We use a planted solution x * with i.i.d. standard normal entries. In Fig. <ref type="figure">2</ref> we show the convergence of (Minibatch-HBM) with row norm sampling and uniform sampling at several values of B. As expected, row norm sampling works better than uniform sampling for a fixed value of B. However, since the norms of rows are not varying too significantly, the convergence rates are still comparable. See <ref type="bibr">Needell et al. (2014)</ref> for a further discussion on sampling probabilities in the context of RK and SGD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Sensitivity to batch size</head><p>The fact (Minibatch-HBM) exhibits accelerated convergence is an artifact of batching, and we expect different behavior at different batch sizes. In fact, we have already observed this phenomenon on the previous example. In Theorem 3.2, we provide an upper bound on the required batch size depending on spectral properties of A T A such as &#954; = &#955; ave /&#955; min and &#954; = &#955; max /&#955; min . To study whether the stated dependence on such quantities is reasonable, we construct a series of synthetic problems with prescribed spectra.</p><p>F1g. 2. Median and 5th to 95th percentile error norm x k -x * of (Minibatch-HBM) for row norm sampling and uniform sampling at varying values of batch size B. For reference, we also show the convergence of (HBM) and the (HBM) bound (2.14).</p><p>We construct problems A = U&#931;V T by choosing the singular vectors U and V uniformly at random, and selecting singular values {) i } with exponential or algebraic decay. For exponential decay we use the squared singular values<ref type="foot">foot_5</ref> </p><p>and for algebraic decay we use the squared singular values</p><p>In both cases, the condition number of the A T A is &#954; and &#961; determines how fast the singular values of A decay. We again use a planted solution x * with i.i.d. standard normal entries.</p><p>In Figs <ref type="figure">3</ref> and <ref type="figure">4</ref> we report the results of our experiments. Here we consider consistent equations with condition numbers &#954; = 10, 30, 100. For each value of &#954;, we generate two problems according to (4.1) with &#961; = 0.1 and &#961; = 0.8 and two problems according to (4.2) with &#961; = 2 and &#961; = 1. In Fig. <ref type="figure">3</ref> we run (Minibatch-HBM) (row norm sampling) with B = cB * for c = 10 -3 , 10 -2 , 10 -1 , 10 0 and show the convergence as a function of the iterations k. In Fig. <ref type="figure">4</ref> we run (Minibatch-HBM) for a fixed number of iterations, and show the convergence as a function of the batch size B. This empirically illustrates that when the batch size is near B * , the rate of convergence is nearly that of (HBM). computation. <ref type="foot">5</ref> We also note that, while the error in the iterates stagnates at different points, the value of the objective function is quite similar in all cases, nearly matching the residual norm of the true least squares solution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Computational tomography</head><p>One of the most prevalent applications of Kaczmarz-like methods is in tomographic image reconstruction, notably in medical imaging. In X-ray tomography (e.g. CT scans), X-rays are passed through an object and the intensity of resulting X-ray beam is measured. This process is repeated for numerous known orientations of the X-ray beams relative to the object of interest. With a sufficient number of F1g. 5. Median and 5th to 95th percentile error norm and residual norm of (Minibatch-HBM) for varying values of batch size B on an inconsistent problem. For reference, we also show the convergence of (HBM) as well as optimal residual norm. measurements, it is possible to reconstruct a 'slice' of the interior of the object of interest. In theory, this reconstruction involves solving a large, sparse consistent linear system.</p><p>In this example we consider the performance of (Minibatch-HBM) on a tomography problem corresponding to a parallel beam geometry scanner with 128 sensor pixels. Measurements are taken for 360 degrees of rotation at half-degree increments, totaling 720 measurements. The goal is to reconstruct a 64 &#215; 64 pixel image. This results in a measurement matrix of dimensions (720 &#8226; 128) &#215; (64 &#8226; 64) = 92160 &#215; 4096, which we construct using the ASTRA toolbox <ref type="bibr">(van Aarle et al., 2015)</ref>.</p><p>We employ a planted solution of a walnut, which we aim to recover from the resulting measurement data. Uniform sampling is employed due to the roughly similar norms of all rows. The step-size &#945; and momentum parameter &#946; are chosen so that (HBM) converges at a reasonable rate and so that B * = 407 is not too large relative to n. In Fig. <ref type="figure">6</ref> we report the convergence of (HBM) and (Minibatch-HBM), along with the resulting images recovered by the algorithms after k = 500 iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>We provided a first analysis of accelerated convergence of minibatch heavy ball momentum method for quadratics using standard choices of the momentum step-sizes. Our proof method involves a refined quantitative analysis of the convergence proof for (deterministic) heavy ball momentum, combined with matrix concentration results for sums and products of independent matrices. Our proof technique is general, and also can be used to verify the accelerated convergence of a minibatch version of Nesterov's acceleration for quadratics, using the constant step-size parameters suggested in Nesterov's original paper. An interesting direction for future work is to combine the simple minibatch momentum algorithm with an adaptive gradient update such as AdaGrad <ref type="bibr">(McMahan &amp; Streeter, 2010;</ref><ref type="bibr">Duchi et al., 2011;</ref><ref type="bibr">Ward et al., 2019;</ref><ref type="bibr">D&#233;fossez et al., 2020)</ref>, to potentially learn the momentum parameters &#945; k and &#946; k adaptively. Such an analysis would also shed light on the convergence of ADAM <ref type="bibr">(Kingma &amp; Adam, 2014)</ref>, an extension of momentum which combines adaptive gradients and momentum in a careful way to achieve state-of-the-art performance across various large-scale optimization problems. Again, T is unitarily similar to a block diagonal matrix whose blocks are</p><p>and it is easy to see the eigenvalues of T j are</p><p>Rather than aiming to optimize the parameters &#945; and &#946; as we did for (HBM), we will simply use the standard choices of parameters suggested in Nesterov ( <ref type="formula">2013</ref>):</p><p>By direct computation, we find 4&#946;</p><p>and therefore that</p><p>Thus, the (NAG) iterates satisfy the convergence guarantee</p><p>where M(&#945;, &#946;) is the eigenvector condition number of the transition matrix T (note that this value is different from the value for (HBM) bounded in Lemma 2.1). We now provide a bound, analogous to Theorem 3.2, for minibatch-NAG, the algorithm obtained by using the stochastic gradients (1.2) in (NAG).</p><p>T#eo(e/ A.1 Set &#8467; = &#955; min , L = &#955; max and define &#945; = 1/L and &#946; = ( &#8730; L/&#8467; -1)/( &#8730; L/&#8467; + 1). For any</p><p>Then, for all k &gt; 0, assuming that the minimizer x * satisfies Ax * = b, the (Minibatch-HBM) iterates satisfy</p><p>The proof of Theorem A.1 is almost the same as the proof identical to the proof of Theorem 3.2, so we skip repeated parts.</p><p>Proof. For NAG we have that Thus, using the submultiplicativity of the operator norm, analogous to <ref type="bibr">(3.6)</ref>, we have that</p><p>Again, using <ref type="bibr">Lemma 3.1 and (3.6)</ref>, we see that E X S i -EX S i 2 &#8804; &#948; provided that the batch size B satisfies B &#8805; 8e&#951; log(2d) max 5 A 2 F A 2 &#945; 2 M(&#945;, &#946;) 2 &#948; -2 , (20|A 4 F &#945; 2 M(&#945;, &#946;) 2 &#948; -2 ) 1/2 . Using the same choice of &#948; 2 = &#946; log(k * )/(2k * ) we get the desired bound.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from https://academic.oup.com/imajna/article/45/3/1397/7729526 by guest on 01 February 2026 ON THE FAST CONVERGENCE OF MINIBATCH HBM</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Downloaded from https://academic.oup.com/imajna/article/45/3/1397/7729526 by guest on 01 February 2026</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_2"><p>If (2.6)  does not hold, then the larger of |z &#177; j | will be greater than &#8730; &#946;.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_3"><p>In particular, the blocks T j corresponding to the smallest and largest eigenvalues each have only a single eigenvector. Downloaded from https://academic.oup.com/imajna/article/45/3/1397/7729526 by guest on 01 February 2026</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_4"><p>Without this assumption we still have a (messy) bound for R.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_5"><p>This spectrum is often referred to as the 'model problem' in numerical analysis and is commonly used to study the convergence of iterative methods for solving linear systems of equations(Strakos, 1991; Strakos &amp; Greenbaum, 1992).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_6"><p>While (Minibatch-HBM) uses significantly more floating point operations than RK, the runtime to fixed accuracy (for all batch sizes tested) was actually significantly lower than RK due to vectorized operations. Since runtime is highly system dependent, we do not provide actual timings.</p></note>
		</body>
		</text>
</TEI>
