<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Compressible Dynamics in Deep Overparameterized Low-Rank Learning &amp; Adaptation</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the 41 st International Conference on Machine Learning</publisher>
				<date>05/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10578917</idno>
					<idno type="doi"></idno>
					
					<author>Can Yaras</author><author>Peng Wang</author><author>Laura Balzano</author><author>Qing Qu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[While overparameterization in machine learning models offers great benefits in terms of optimization and generalization, it also leads to increased computational requirements as model sizes grow. In this work, we show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of overparameterization without the computational burdens. In practice, we demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models. Our approach is grounded in theoretical findings for deep overparameterized low-rank matrix recovery, where we show that the learning dynamics of each weight matrix are confined to an invariant lowdimensional subspace. Consequently, we can construct and train compact, highly compressed factorizations possessing the same benefits as their overparameterized counterparts. In the context of deep matrix completion, our technique substantially improves training efficiency while retaining the advantages of overparameterization. For language model fine-tuning, we propose a method called "Deep LoRA", which improves the existing low-rank adaptation (LoRA) technique, leading to reduced overfitting and a simplified hyperparameter setup, while maintaining comparable efficiency. We validate the effectiveness of Deep LoRA on natural language tasks, particularly when fine-tuning with limited data.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In recent years, there has been a growing interest within the realm of deep learning in overparameterization, which refers to employing a greater number of model parameters than necessary to interpolate the training data. While this may appear counterintuitive initially due to the risk of overfitting, it has been demonstrated to be an effective modeling approach <ref type="bibr">(Zhang et al., 2021;</ref><ref type="bibr">Wu et al., 2017;</ref><ref type="bibr">Allen-Zhu et al., 2019a;</ref><ref type="bibr">Buhai et al., 2020;</ref><ref type="bibr">Xu et al., 2018)</ref>, primarily attributed to improved optimization landscape and implicit algorithmic regularization. In the context of large language models (LLMs) <ref type="bibr">(Radford et al., 2019;</ref><ref type="bibr">Brown et al., 2020)</ref>, empirical scaling laws <ref type="bibr">(Kaplan et al., 2020)</ref> suggest that larger models are more sample efficient, often requiring fewer samples to reach the same test loss.</p><p>Taking the problem of low-rank matrix recovery as an illustrative example, the seminal work of <ref type="bibr">Arora et al. (2019)</ref> showed that deeper factorizations better promote low-rank solutions as a function of depth, consequently mitigating overfitting in the overparameterized regime compared to a classical two-layer factorization approach; see Figure <ref type="figure">1</ref> (left). On the other hand, increasing the width of each layer substantially reduces the number of iterations to reach the same training error; see Figure <ref type="figure">1 (right)</ref>.</p><p>While overparameterization offers remarkable benefits, it also comes with its computational challenges. The significantly increased number of parameters inevitably results in dramatically higher computational costs. This naturally raises a fundamental question: can we attain the benefits of Figure <ref type="figure">2</ref>. Invariant low-dimensional subspaces in deep overparameterized adaptation of language models. Fine-tuning BERT <ref type="bibr">(Devlin et al., 2019)</ref> with deep overparameterized adaptation on the STS-B dataset <ref type="bibr">(Cer et al., 2017)</ref>. Left: Singular value spectra across all adapted layers at the end of fine-tuning. Middle: Alignment of subspaces formed by top 8 right singular vectors between current adapted weights and final adapted weights throughout training. Right: Training loss continues to decrease in iterations after subspace alignment with final adapted weights. See Section 4 for more details.</p><p>overparameterization with a substantial reduction in computational costs?</p><p>In this work, we show that we can achieve this by exploiting low-dimensional structures of data and compressible learning dynamics in the model weights. In the context of low-rank matrix recovery via deep overparameterized factorizations, we discover an interesting phenomenon that for each weight matrix, the learning dynamics only happen within an approximately invariant low-dimensional subspace throughout all iterations. We rigorously prove this for deep matrix factorization, which also allows us to compress the number of training parameters significantly when dealing with deep matrix completion. Consequently, we can construct and train a nearly equivalent, yet much smaller, compressed factorization without sacrificing the advantages of its overparameterized counterpart.</p><p>Interestingly, we empirically find that the above phenomenon can also be observed when employing deep overparameterized weight updates for fine-tuning language models; see Figure <ref type="figure">2</ref> for an illustration. Therefore, we can adapt our idea of compressing deep matrix factorization to improve language model fine-tuning. For fine-tuning large-scale pretrained language models, recently low-rank adaptation (LoRA) stands out as the most commonly-used technique due to its effectiveness and efficiency <ref type="bibr">(Hu et al., 2021)</ref>. The basic idea of LoRA is to freeze the pretrained weights and adapt each one to new tasks by adding and optimizing an update in the form of a two-layer low-rank decomposition. Nonetheless, in practical scenarios, selecting the optimal rank of the decomposition can pose a significant challenge. If the rank is not chosen properly, it may lead to overfitting, particularly when we overestimate the rank or when there is limited downstream data available.</p><p>We deal with this drawback of LoRA by employing a deep (three-layer) overparameterized factorization for the trainable update, which is constructed and optimized via the compression technique used for deep matrix completion. As such, our new method, which we term as Deep LoRA, enjoys notable advantages over the original LoRA method, namely (i) less overfitting by exploiting depth, and (ii) fewer hyperparameters without rank r and scale &#945; having to be carefully tuned across all layers, all while having a comparable parameter efficiency due to compression.</p><p>Contributions. We summarize our contributions below.</p><p>&#8226; Practical contributions. We develop efficient compression methods by exploring compressible learning dynamics in overparameterized factorizations. Our method enjoys the benefits of overparameterization while significantly improving its efficiency. We demonstrate the effectiveness not only on deep matrix completion, but also for improving LoRA for language model fine-tuning.</p><p>&#8226; Theoretical contributions. Our methods are inspired by our theoretical results for deep matrix factorization. Mathematically, we rigorously prove the existence of invariant low-dimensional subspaces throughout gradient descent for each weight matrix, and show how they can constructed in practice.</p><p>Related Works There is a great deal of literature on implicit regularization in the setting of matrix factorization/linear networks <ref type="bibr">(Neyshabur et al., 2015;</ref><ref type="bibr">Gunasekar et al., 2017;</ref><ref type="bibr">Arora et al., 2019;</ref><ref type="bibr">Moroshko et al., 2020;</ref><ref type="bibr">Timor et al., 2023;</ref><ref type="bibr">Ji &amp; Telgarsky, 2019;</ref><ref type="bibr">Gidel et al., 2019;</ref><ref type="bibr">You et al., 2020;</ref><ref type="bibr">Liu et al., 2022)</ref>, as well as low-rank learning in deep networks <ref type="bibr">(Jaderberg et al., 2014;</ref><ref type="bibr">Sainath et al., 2013;</ref><ref type="bibr">Denil et al., 2013;</ref><ref type="bibr">Khodak et al., 2020;</ref><ref type="bibr">Oymak et al., 2019;</ref><ref type="bibr">Min Kwon et al., 2024;</ref><ref type="bibr">Tarzanagh et al., 2023)</ref>. Similarly, there is an abundance of work discussing the benefits of overparameterization <ref type="bibr">(Du &amp; Hu, 2019;</ref><ref type="bibr">Arora et al., 2018b;</ref><ref type="bibr">Allen-Zhu et al., 2019b;</ref><ref type="bibr">Arpit &amp; Bengio, 2019)</ref>.</p><p>Following the advent of LoRA <ref type="bibr">(Hu et al., 2021)</ref>, there have been many follow-up works <ref type="bibr">(Zhang et al., 2022;</ref><ref type="bibr">Chavan et al., 2023;</ref><ref type="bibr">Kopiczko et al., 2024)</ref>. Importantly, no existing work (to the best of our knowledge) explicitly deals with overfitting, particularly in the limited sample regime. A more detailed discussion of related works can be found in Appendix A.</p><p>Notation. Given any L &#8712; N, we use [L] to denote the index set {1, . . . , L}. We use t &#8712; Z &#8805;0 to index a countable set of objects, such as W (t). We denote by I n the identity matrix of size n &#215; n, and by 1 n a vector of length n with all entries equal to 1. We denote by &#8741;A&#8741; 2 F the squared Frobenius norm of matrix A, i.e., the sum of squares of all entries of A. We denote by N (A) the nullspace of the matrix A. We denote by O m&#215;n the set of m &#215; n matrices with either orthogonal rows or columns. We denote by &#8857; the Hadamard product, i.e., entry-wise multiplication. For convenience, whenever j &gt; i we adopt the abbreviations</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Warm-up Study: Deep Matrix Factorization</head><p>Towards gaining theoretical insights into the phenomena in Figure <ref type="figure">2</ref>, we first build some intuition based on the problem of deep matrix factorization. Under simplified settings, we rigorously unveil the emergence of low-dimensionality and compressibility in gradient descent learning dynamics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Basic Setup</head><p>Given a low-rank matrix &#934; &#8712; R dx&#215;dy with rank(&#934;) = r * , we approximate the matrix &#934; by an L-layer deep overparameterized factorization</p><p>where &#920; = (W l ) L l=1 are the parameters with weights W l &#8712; R d l &#215;d l-1 for l &#8712; [L]. We consider the case where the weights are all square</p><p>and learn the parameters &#920; by solving</p><p>via gradient descent (GD) from scaled orthogonal initialization, i.e., we initialize parameters &#920;(0) such that</p><p>(3) where &#1013; l &gt; 0. We assume this for ease of analysis, and believe that our results could hold for arbitrary small initial-ization. For each weight matrix, the GD iterations can be written as</p><p>for t = 0, 1, 2, . . . , where &#951; &gt; 0 is the learning rate and &#955; &#8805; 0 is an optional weight decay parameter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Main Theorem</head><p>We show that learning only occurs within an invariant lowdimensional subspace of the weight matrices, whose dimensionality depends on rank(&#934;).</p><p>Theorem 2.1. Let W l (t) satisfy the initialization scheme (3) and updates (4), and suppose &#934; &#8712; R d&#215;d is at most rank r and let m := d -2r &gt; 0. Then there exist orthogonal matrices</p><p>for all l &#8712; [L] and t &#8805; 0, where W l (t) &#8712; R 2r&#215;2r with W l (0) = &#1013; l I 2r , and</p><p>for all l &#8712; [L] and t &#8805; 1 with &#961; l (0) = &#1013; l .</p><p>In the following, we discuss several implications of our result and its relationship to previous work.</p><p>&#8226; SVD dynamics of weight matrices. The decomposition (5) is closely related to the singular value decomposition (SVD) of W l (t). Specifically, let d-2r) . Let W l (t) = U l (t) &#931; l (t) V &#8868; l (t) be an SVD of W l (t), where U l (t), V l (t) &#8712; O 2r and &#931; l (t) &#8712; R 2r&#215;2r is a diagonal matrix. Then, by (5) we can write W l (t) as</p><p>&#8868; which is essentially an SVD of W l (t) (besides the ordering of singular values). According to this, we can verify that &#961;(t) is a (repeated) singular value undergoing minimal changes across iterations illustrated in Figure <ref type="figure">3</ref> (left). Additionally, these repeated singular values correspond to invariant subspaces U l,2 , V l,2 that are stationary throughout GD, as seen in Figure <ref type="figure">3</ref> (middle and right).</p><p>&#8226; Low-rank bias. From (6), we can show under mild assumptions that the GD trajectory for each weight matrix either remains or tends towards a solution with rank at most 2r. This is true whether we employ implicit or explicit regularization. Indeed, if we use small initialization &#1013; l &#8776; 0 with no weight decay &#955; = 0, then the fact that &#961; l is a decreasing sequence (w.r.t. iteration) implies that the approximate rank of W l (t) can be no more than 2r throughout the entire trajectory. On the other hand, if we use weight decay with &#955; &gt; 0, then we have &#961; l (t) &#8594; 0 as t &#8594; &#8734;. This forces W l (t) towards a solution of rank at most 2r when the training converges. See Appendix E.2 for a formal statement and proof. This result is consistent with previous findings on low-rank and simplicity bias in deep networks <ref type="bibr">(Huh et al., 2022;</ref><ref type="bibr">Galanti &amp; Poggio, 2022;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Chou et al., 2024)</ref>.</p><p>&#8226; Comparison to prior arts. In contrast to existing work studying implicit bias of GD towards low-rank solutions <ref type="bibr">(Gunasekar et al., 2017;</ref><ref type="bibr">Arora et al., 2019)</ref>, our result explicitly shows how GD finds these solutions. Moreover, unlike previous work on implicit bias <ref type="bibr">(Min et al., 2021;</ref><ref type="bibr">Gissin et al., 2019;</ref><ref type="bibr">Arora et al., 2019;</ref><ref type="bibr">Vardi &amp; Shamir, 2021)</ref>, we also examine the effect of weight decay, which is commonly employed during the training of deep networks. Our analysis is distinct from that of <ref type="bibr">(Saxe et al., 2014;</ref><ref type="bibr">2019)</ref>, which studied continuous time dynamics under the special (separable) setting</p><p>In comparison, our result applies to discrete time dynamics and holds for initialization that is agnostic to the target matrix. It should also be noted that our result does not depend on balanced initialization like those in <ref type="bibr">(Arora et al., 2018a)</ref>, as the initialization scale &#1013; l for each layer can be arbitrarily different from one another.</p><p>A sketch of analysis. We now provide a rough sketch for the beginning of the proof of Theorem 2.1 in the special case of small initialization &#1013; l = &#1013; &#8776; 0 for all l &#8712; [L] and &#955; = 0, highlighting the construction of the invariant subspace at initialization. The full proof can be found in Appendix E.1.</p><p>Proof sketch. Since &#1013; L &#8776; 0, from the gradient of &#8467;(&#920;) (see Appendix E), we have</p><p>implying that the rank of G 1 is (approximately) at most r. Now consider the subspace</p><p>), where we have dim S &#8805; d -2r. Then, there exist orthonormal sets {v i } d-2r</p><p>i=1 and</p><p>form singular vector pairs of both W 1 (0) and W 1 (1) simultaneously as they remain unchanged by the gradient update G 1 , giving the last d -2r columns of V 1 and U 1 respectively. To see that we can take V 2 = U 1 , for instance, we note that</p><p>showing that u i are invariant under gradient updates in the second layer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Compression of Overparameterized Factorization</head><p>We now show that, as a consequence of Theorem 2.1 and the proof sketch, we can run GD on dramatically fewer parameters to achieve a near identical end-to-end trajectory as the original (full-width) factorization; see Figure <ref type="figure">4</ref>.</p><p>Constructing the "equivalent" compressed factorization. More specifically, given that &#934; is at most rank r and d -2r &gt; 0, from Theorem 2.1 we observe that</p><p>when we use initialization of small scale (i.e., (&#1013; l ) L l=1 are small). Here,</p><p>) denotes the compressed function with compressed parameters &#920; = ( W l ) L l=1 . As such, we can expect that solving</p><p>will approximately give the same solution as (2).</p><p>Constructing the factors (U l , V l ) L l=1 . As Theorem 2.1 only showed the existence of (U l , V l ) L l=1 , to solve (9) via GD, we need a practical recipe for constructing (U l , V l ) L l=1 efficiently at initialization of small scale &#1013; l . This can be achieved based upon our proof sketch in Section 2.2: we compute</p><p>), and complete to an orthonormal basis to yield V 1 . The remaining U l , V l can then be iteratively constructed via</p><p>It should be noted that these compressed factors are related to, yet distinct from, spectral initialization, which is wellstudied in the literature <ref type="bibr">(Chi et al., 2019;</ref><ref type="bibr">Khodak et al., 2020;</ref><ref type="bibr">St&#246;ger &amp; Soltanolkotabi, 2021)</ref>. Since U l,1 , V l,1 are constructed via orthogonal complements to nullspaces involving the gradient, these directions do indeed correlate with the top singular subspaces of &#934; in the deep matrix factorization case (although we do not use the singular value information). On the other hand, our approach is more general through the lens of compression, as it can be applied to a given deep overparameterized factorization trained on an arbitrary loss.</p><p>Optimization, complexity, and approximation error. In summary, we can approximately solve the original problem by solving (9) via GD for the compressed parameters &#920; = ( W l ) L l=1 , starting from small initialization (&#1013; l &#8776; 0). The factors U L,1 , V 1,1 can be efficiently constructed based upon an iterative scheme that we discussed above from the initial weights.</p><p>Comparing the parameter counts of the compressed</p><p>Since r &#8810; d, our approach leads to significant improvement in efficiency during GD; see Figure <ref type="figure">4</ref> (right). On the other hand, compression requires some additional computation to construct the factors U L,1 and V 1,1 prior to training, which involves taking a gradient of the first weight in the original factorization followed by an SVD or QR decomposition to compute an orthonormal basis for S. While this requires an additional O(d 3 ) compute, this has the same complexity as a single iteration of GD for the original factorization and is therefore a negligible overhead when comparing the two. Finally, the following result demonstrates that our compression method can achieve an almost identical end-to-end trajectory when we use small initializations; see Figure <ref type="figure">4</ref> (left).</p><p>Proposition 2.2. For r such that m := d -2r &gt; 0, if we run GD on the compressed weights &#920; as described above for the loss (9), we have</p><p>Here, &#1013; l is the initialization scale for the weight W l (0).</p><p>The key idea of Proposition 2.2 is that GD is invariant under orthogonal transformations, and each factor W l in the endto-end factorization in (9) is the result of an orthogonal transformation of W l . Then, the approximation error m &#8226; L l=1 &#1013; 2 l is only due to the approximation we showed in (8). We defer the full proof to Appendix E.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Application I: Deep Matrix Completion</head><p>In this section, we show that we can generalize our method in Section 2 from vanilla matrix factorization to solving low-rank matrix completion problems <ref type="bibr">(Candes &amp; Recht, 2012;</ref><ref type="bibr">Cand&#232;s &amp; Tao, 2010;</ref><ref type="bibr">Davenport &amp; Romberg, 2016)</ref> via compressed deep factorizations. Given a ground-truth &#934; &#8712; R d&#215;d with rank r * &#8810; d, the goal of low-rank matrix completion is to recover &#934; from only a few number of observations encoded by a mask &#8486; &#8712; {0, 1} d&#215;d . Adopting a matrix factorization approach, we minimize the objective</p><p>where f (&#920;) is the deep overparameterized factorization introduced in (1). The problem simplifies to deep matrix factorization (2) that we studied earlier when &#8486; = 1 d 1 &#8868; d in the full observation case. Additionally, (10) reduces to vanilla (shallow) matrix factorization when L = 2, whose global optimality and convergence have been widely studied under various settings <ref type="bibr">(Jain et al., 2013;</ref><ref type="bibr">Zheng &amp; Lafferty, 2016;</ref><ref type="bibr">Sun &amp; Luo, 2016;</ref><ref type="bibr">Ge et al., 2016;</ref><ref type="bibr">Bhojanapalli et al., 2016;</ref><ref type="bibr">Ge et al., 2017;</ref><ref type="bibr">Gunasekar et al., 2017;</ref><ref type="bibr">Li et al., 2019;</ref><ref type="bibr">Chi et al., 2019;</ref><ref type="bibr">Li et al., 2018b;</ref><ref type="bibr">Soltanolkotabi et al., 2023;</ref><ref type="bibr">Sun et al., 2018;</ref><ref type="bibr">Zhang et al., 2020;</ref><ref type="bibr">Ding et al., 2021)</ref>.</p><p>A double-edged sword of overparameterization. In practice, the true rank r * is not known -instead, we assume to have an upper bound r of the same order as r * , i.e., r * &#8804; r &#8810; d. Surprisingly, overparameterization has advantages in terms of both depth L and width r:</p><p>&#8226; Benefits of depth: mitigating overfitting. When r &gt; r * , it has been demonstrated <ref type="bibr">(Arora et al., 2019</ref>) that optimizing deeper factorizations (i.e., L &#8805; 3) generalize better in the low sample regime, while their shallow counterparts overfit; see Figure <ref type="figure">1</ref> (left).</p><p>&#8226; Benefits of width: improving convergence. On the other hand, increasing the width r of the deep factorization beyond r * results in accelerated convergence in terms of iterations, see Figure <ref type="figure">1</ref> (right).</p><p>However, the advantages of overparameterization come with the challenges of much higher computational costs. For an L-layer factorization of (full)-width d, we require O(L &#8226; d 3 ) multiplications per iteration to evaluate gradients and need to store O(L &#8226; d 2 ) parameters, where d is often very large.</p><p>Using ideas from Section 2, however, we can obtain the benefits of overparameterization without the extra computational costs.</p><p>Compression for deep matrix completion. Given the similarity between deep matrix factorization and completion (i.e., &#8486; = 1 d 1 &#8868; d vs arbitrary &#8486;), it seems straightforward to generalize our compression methods in Section 2.3 to deep matrix completion. However, as shown by the orange trace in Figure <ref type="figure">5</ref>, direct application does not work well, as the compressed factorization's trajectory diverges from that of the original. This is because the compressed subspaces</p><p>] can be misaligned with the true subspace due to the perturbation by the mask &#8486;.</p><p>Nonetheless, this issue can be mitigated by slowly updating U L,1 , V 1,1 during training. Specifically, compared to (9), we minimize min</p><p>via GD by updating &#920;, U L,1 , V 1,1 simultaneously every iteration, with a learning rate &#951; on &#920; along with a discrepant learning rate &#947;&#951; on U L,1 , V 1,1 . Because we update U L,1 , V 1,1 slower than updating &#920;, we generally choose &#947; &gt; 0 to be small and tuned accordingly for the given problem.</p><p>As a result, we reduce computational costs to O((L+d)&#8226;r 2 ) multiplications per iteration for computing gradients, and O(d &#8226; r + L &#8226; r 2 ) parameters. Yet still the trajectory of the deep compressed factorization ultimately aligns with that of the original, while converging roughly 5&#215; faster w.r.t. wall-time, as demonstrated in Figure <ref type="figure">5</ref>. Moreover, the accelerated convergence induced by the full-width trajectory results in the compressed factorization being 3&#215; faster than randomly initialized factorizations of similar width -see Appendix D.1 for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Application II: Model Fine-tuning</head><p>In this section, we show that our compression idea can be further extended to parameter-efficient fine-tuning of   For each choice of rank r, we draw 16 samples at random from STS-B over 5 trials with different seeds, and measure performance on the validation split of each method using the same train set.</p><p>pretrained language models, specifically via low-rank adaptation (LoRA) <ref type="bibr">(Hu et al., 2021)</ref>. In particular, inspired by our approach for deep matrix completion, we propose Deep Low-Rank Adaptation (Deep LoRA), which consistently outperforms vanilla LoRA in the limited sample regime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Deep Low-Rank Adaptation (Deep LoRA)</head><p>Background on LoRA. With the ever-growing size of pretrained models and countless downstream tasks, full model fine-tuning is often computationally infeasible. Given a pretrained model whose parameters consist of a collection of dense weight matrices {W 0k } m k=1 &#8834; R d&#215;d (e.g., the query/key/value projections of a transformer <ref type="bibr">(Vaswani et al., 2017)</ref>), LoRA seeks to adapt each layer to a given task by freezing the pretrained weight {W 0k } m k=1 and optimizing an extra trainable low-rank factorization on top. In other words, the fine-tuned weight W k is given by</p><p>where &#945; &gt; 0 is a tunable scale parameter and</p><p>&#8712; R r&#215;d with r &#8810; d, thereby substantially reducing the number of trainable parameters during finetuning.</p><p>Proposed method: Deep LoRA. For vanilla LoRA, if we view adapting each weight matrix of model as an individual low-rank factorization problem, we have demonstrated in previous sections that overparameterization and subsequently compressing factorizations improves generalization with minimal extra computational costs. With this in mind, we can employ a deep overparameterized adaptation of each pretrained weight as</p><p>where each</p><p>Here, L &gt; 0 is the depth, and typically we choose L = 3, which is precisely the setting of Figure <ref type="figure">2</ref>. From the figure, we can see that (i) all the converged weights {&#8710;W k } m k=1 are very low-rank (left panel), (ii) the learning dynamics each for each weight approximately stay within the same invariant subspace throughout the iterations (middle panel), and (iii) this happens independent of the training loss decreasing (right panel).</p><p>These observations imply that deep overparameterized factorizations in Deep LoRA are highly compressible, so we can apply the compression method from deep matrix completion in Section 3 to compress the learning dynamics for each individual weight for model fine-tuning. Here, the major differences of our compression approach for deep LoRA from that of deep matrix completion is that (i) we have a separate compressed factorization for each layer to be adapted, and (ii) the fine-tuned loss function can be tailored for specific tasks (e.g., the cross-entropy) besides the &#8467; 2 loss.</p><p>Advantages of Deep LoRA. Compared to vanilla LoRA, Deep LoRA has clear advantages that we highlight below. More details are provided in Section 4.2.</p><p>&#8226; Less overfitting in limited data regime. Fine-tuning overparameterized models using LoRA can still result in overfitting in few shot or limited data regime <ref type="bibr">(Sebastian Raschka, 2023)</ref>. In comparison, the extra depth in (12) of Deep LoRA can help prevent overfitting (see Fig- ure 6), which is similar to deep matrix completion in Figure <ref type="figure">1</ref>.</p><p>&#8226; Robustness to the hyperparameter r. As shown in Figure <ref type="figure">8</ref>, by exploiting the intrinsic low-dimensional dynamics in GD via overparameterization in width, our approach is robust to the choice of the rank r in fine-tuning.</p><p>Deep LoRA only requires 3r 2 additional trainable parameters for each adapted layer compared to vanilla LoRA, where r is relatively small (e.g., r = 8).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">More Experimental Details</head><p>To evaluate our approach, we use a pretrained BERT <ref type="bibr">(Devlin et al., 2019)</ref> base model and apply adaptation on all attention and feedforward weights in the transformer, resulting in 72 adapted layers in total. Unless stated otherwise, we use r = 8 for both vanilla and Deep LoRA throughout all experiments, in which case Deep LoRA has roughly 0.01% more parameters (with respect to BERT) than vanilla LoRA. We utilize Adam <ref type="bibr">(Kingma &amp; Ba, 2014)</ref> as an optimizer for both methods. See Appendix B for more details on the experimental setup.</p><p>Advantage I: Better generalization with limited data.</p><p>We first evaluate our approach on tasks in the GLUE benchmark <ref type="bibr">(Wang et al., 2018)</ref>, which is a standard benchmark for natural language understanding. To test the performance in a limited data setting, for one given trial of a single task, we randomly sample 1024 examples from the task data for fine-tuning, and compare the difference in performance on the same train set between Deep LoRA and vanilla LoRA on the entire validation split. From the results shown in Table <ref type="table">1</ref>, we can see that Deep LoRA delivers significant improvements across most tasks compared to vanilla LoRA, and on average improves performance by nearly 3 points, a notable margin.</p><p>This improvement in performance becomes more pronounced in scenarios with severely limited data, such as few-shot settings. Applying the same sampling procedure as in the prior study to the STS-B dataset, we assess both approaches using only n &#8712; {16, 64, 256} training instances.</p><p>Experiments in Figure <ref type="figure">6</ref> illustrate that Deep LoRA consistently surpasses vanilla LoRA across all sample sizes, with the most significant difference observed when n = 16.</p><p>Deep LoRA finds lower rank solutions. We find that at the end of fine-tuning, Deep LoRA finds lower rank solutions for &#8710;W k than vanilla LoRA, as shown in Figure <ref type="figure">7</ref>.</p><p>In the limited data setting (256 samples), we see that all adapted layers in the vanilla LoRA saturate the constrained numerical rank r = 8, while most layers in Deep LoRA are perturbed by matrices with numerical ranks between 0 and 4.<ref type="foot">foot_1</ref> This suggests that Deep LoRA can adaptively select the appropriate rank for each layer depending on the task. This low-rank bias induces implicit regularization during the fine-tuning process and ultimately prevents overfitting to the task, particularly when only few training samples are available. As a practical consideration, Deep LoRA also requires a fraction of the memory cost to store compared to vanilla LoRA due to the parsimony in adapted weights.</p><p>Advantage II: Robustness to choice of rank r. Due to the scarcity of the target training data, choosing the rank r in LoRA is a delicate process -it needs to be large enough to capture the complexity in modeling the downstream task, while small enough to prevent overfitting. The proposed Deep LoRA, on the other hand, avoids catastrophic overfitting as we increase r, as demonstrated in Figure <ref type="figure">8</ref>. This observation mirrors the behavior seen in deep matrix completion, as illustrated in Figure <ref type="figure">1</ref>. For shallow factorizations, an overestimation of rank r leads to an increase in the generalization error. In contrast, deep factorizations remain resilient to overfitting.</p><p>Finally, we show that Deep LoRA outperforms vanilla LoRA for few-shot natural language generation fine-tuning in Appendix C. We also provide an ablation study in Appendix D.2 on the compression mechanism for Deep LoRA and show that it is crucial for accelerating training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion &amp; Future Directions</head><p>In this work, we have provided an in-depth exploration of low-dimensionality and compressibility in the dynamics of deep overparameterized learning, providing theoretical understandings in the setting of deep matrix factorization and applications to efficient deep matrix completion and language model adaptation. Finally, we outline a couple potential future directions following this work.</p><p>Compressibility in non-linear settings. Although the results on network compression in Section 2 exploit the specific gradient structure of deep matrix factorizations, we believe that our analysis can provide meaningful direction for analyzing the fully non-linear case.</p><p>To sketch an idea, consider the setting of Section 2.1 except with a non-linear factorization, i.e., (1) becomes</p><p>where &#963; is (for example) the entry-wise ReLU activation. For concreteness, consider the L = 3 case. The gradient of the loss with respect to, e.g., W 2 in ( <ref type="formula">2</ref>) is given by</p><p>) &#8868; where h is the entry-wise unit step function and E = f (&#920;) -&#934;. Comparing this to the gradient in the linear setting ( <ref type="formula">14</ref>), there is a great deal of shared structure, with the two main differences being the non-linearity applied to W 1 in the post factor and a projection on the inner term via h(W 2 &#963;(W 1 )). However, we still have the lowrank structure of W &#8868; 3 E, and the zeroing out of certain entries preserves approximate spectral properties of the matrix <ref type="bibr">(Chatterjee, 2015)</ref>. Moreover, this projection is akin to the masking via &#8486; as in deep matrix completion from Section 3, for which we do find compressible dynamics. In Figure <ref type="figure">9</ref>, we plot the singular value spectrum of the above gradient at small initialization, finding that the top r * singular values separate from the rest of the spectrum. This suggests that we may be able to identify a low-dimensional subspace along which we can achieve similar dynamics to the full parameter space.</p><p>Extensions to Deep LoRA. We have demonstrated the efficacy of Deep LoRA for natural language understanding and generation in Section 4 and Appendix C respectively. However, it would be meaningful to evaluate Deep LoRA in other modalities, e.g., diffusion models, where fine-tuning on limited data is commonplace. Moreover, the high degree of alignment at initialization to the final adapted subspaces shown in Figure <ref type="figure">2</ref> suggests that SGD (rather than Adam) can be used for the outer factors of Deep LoRA, further reducing memory costs. Finally, exploring the use of second-order methods to accelerate fine-tuning along the rank-r subspace could be a potential improvement. Implications for representation learning. The low-rank bias in the end-to-end features of deep networks may have important connections to emergent phenomena in representation learning, such as deep neural collapse <ref type="bibr">(Zhai et al., 2024;</ref><ref type="bibr">Zhou et al., 2022b;</ref><ref type="bibr">Yaras et al., 2022;</ref><ref type="bibr">Wang et al., 2022;</ref><ref type="bibr">Zhou et al., 2022a;</ref><ref type="bibr">Zhu et al., 2021;</ref><ref type="bibr">Beaglehole et al., 2024;</ref><ref type="bibr">Li et al., 2024)</ref>, whereby the last-layer representations exhibit surprisingly simple structures. Moreover, by uncovering the low-rank evolution of individual weights, we could shed light on more intricate phenomena such as progressive neural collapse <ref type="bibr">(He &amp; Su, 2023;</ref><ref type="bibr">Wang et al., 2023)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impact Statement</head><p>This paper presents work on training and fine-tuning large language models (LLMs), the consequences of which include (but are not limited to) the propagation of biases, privacy violations from data misuse, and significant environmental costs due to high energy consumption. results in faster convergence. However, increasing the width alone also increases computational costs -by employing compression, we can achieve the best of both worlds. We verify that compression is crucial for the efficiency of Deep LoRA. We compare the performance of three different approaches: (i) Original, where we use a three-layer full-width factorization as in ( <ref type="formula">12</ref>), (ii) Compressed, which is the rank-r compression of (12) (a.k.a. Deep LoRA), and (iii) Random, where the W (i) k in ( <ref type="formula">12</ref>) are initialized randomly with W</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2. Deep LoRA</head><p>(2) k &#8712; R r&#215;r . We can see that via compression, Deep LoRA can achieve similar convergence behavior to the original overparameterized factorization with much fewer parameters, while the randomly initialized version takes much longer to train, similar to the result for deep matrix completion in Appendix D.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Proofs</head><p>The analytic form of the gradient &#8711; &#920; &#8467;(&#920;) is given by</p><p>where E = f (&#920;) -&#934;, which when substituted into (4) gives the update rules</p><p>/&#1013; k+1 . This implies that W k+1 (0)v</p><p>. Moreover, we have</p><p>i /&#1013; k+1 = 0, where the first two equalities follow from orthogonality and u</p><p>/&#1013; k+1 , and the last equality is due to v</p><p>where the second equality follows from v</p><p>and the third equality is due to</p><p>i . Therefore E(k + 1) holds, so we have E(l) for all l &#8712; [L]. As a result, we have shown the base cases A(0), B(0), C(0), and D(0). Now we proceed by induction on t &#8805; 0. Suppose that A(t), B(t), C(t), and D(t) hold for some t &#8805; 0. First, we show A(t + 1) and B(t + 1). We have</p><p>where the first equality follows from (15), the second equality follows from definition of E(t), the third equality follows from D(t), and the fourth equality follows from A(t) and B(t) applied repeatedly along with v</p><p>proving A(t + 1). Similarly, we have</p><p>where the third equality follows from C(t), and the fourth equality follows from A(t) and B(t) applied repeatedly along with v</p><p>Repeatedly applying the above equality for k = l, l + 1, . . . , L -1, we obtain</p><p>which follows from C(t), proving C(t + 1). Finally, we show D(t + 1). For any k &#8712; {2, . . . , L}, it follows from v</p><p>Repeatedly applying the above equality for k = l, l -1, . . . , 2, we obtain</p><p>which follows from D(t). Thus we have proven D(t + 1), concluding the proof.</p><p>Proof of Theorem 2.1. By A(t) and B(t) of Lemma E.1, there exists orthonormal matrices {U l,2 } L l=1 &#8834; R d&#215;m and {V l,2 } L l=1 &#8834; R d&#215;m for l &#8712; [L] satisfying V l+1,2 = U l,2 for all l &#8712; [L -1] as well as W l (t)V l,2 = &#961; l (t)U l,2 and W l (t) &#8868; U l,2 = &#961; l (t)V l,2</p><p>for all l &#8712; [L] and t &#8805; 0, where &#961; l (t) satisfies (6) for t &#8805; 1 with &#961; l (0) = &#1013; l . First, complete V 1,2 to an orthonormal basis for</p><p>Then for each l &#8712; [L -1], set U l = [U l,1 U l,2 ] where U l,1 = W l (0)V l,1 /&#1013; l and V l+1 = [V l+1,1 V l+1,2 ] where V l+1,1 = U l,1 , and finally set U L = [U L,1 U L,2 ] where U L,1 = W L (0)V L,1 /&#1013; L . We note that V l+1 = U l for each l &#8712; [L -1] and U l , V l are orthogonal since W l (0)/&#1013; l is orthogonal for all l &#8712; [L]. Then we have</p><p>for all l &#8712; [L] and t &#8805; 0, where the first equality follows from ( <ref type="formula">16</ref>). Similarly, we also have</p><p>for all l &#8712; [L] and t &#8805; 0, where the first equality also follows from ( <ref type="formula">16</ref>). Therefore, combining ( <ref type="formula">16</ref>), (17), and (18) yields</p><p>for all l &#8712; [L], where W l (0) = &#1013; l I 2r by construction of U l,1 . This directly implies (5), completing the proof.</p><p>E.2. Low-rank bias in Theorem 2.1</p><p>Here, we verify the claims following Theorem 2.1 and give a precise characterization of the rate of decay of &#961; l as given by ( <ref type="formula">6</ref>) and the conditions on learning rate &#951; needed to achieve such behavior. These are given in the following lemma.</p><p>Lemma E.2. In the setting of Theorem 2.1, suppose 0 &lt; &#1013; l = &#1013; &#8804; 1 for all l &#8712; [L] and 0 &lt; &#951; &#8804; 1 &#955;+&#1013; . Then for all t &#8805; 0, the updates of &#961; l (t) in (6) satisfy &#961; l (t) = &#961;(t) for some &#961;, and</p><p>Since &#955; and &#1013; are often chose to be small, the above lemma implies that a small learning rate is not required to achieve a low-rank solution. Moreover, by choice of &#951;, when weight decay is employed (i.e., &#955; &gt; 0) the above inequality implies that &#961;(t) &#8594; 0 as t &#8594; &#8734;. When &#955; = 0, we instead have that &#961; is bounded by &#1013;.</p><p>Proof of Lemma E.2. If &#961; l (0) = &#1013; for all l &#8712; [L], it is clear that &#961; l (t) = &#961;(t) for some &#961; for all t &#8805; 0, and the updates take the form</p><p>for each t &#8805; 0. We proceed by induction. For t = 0, since &#961;(0) = &#1013;, the claim holds trivially. Now suppose (19) holds for some t &#8805; 0. By choice of &#951;, we have that 1 -&#951; &#8226; (&#955; + &#1013;) &#8805; 0, so &#961;(t) &#8805; 0. It then follows that</p><p>by the fact that &#961;(t) &#8804; &#1013; &#8226; (1 -&#951;&#955;) t . Next, by choice of &#951; and initial condition, we have that &#961;(t) &#8804; &#1013;, so that &#961;(t + 1) = &#961;(t) &#8226; 1 -&#951; &#8226; &#955; + &#961;(t) 2(L-1) &#8805; &#961;(t)</p><p>since &#1013; 2(L-1) &#8804; &#1013; by &#1013; &#8804; 1. The claim follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.3. Proof of Proposition 2.2</head><p>Proof. First, it follows from Theorem 2.1 that for any 1 &#8804; i &#8804; j &#8804; L we have</p><p>for all t &#8805; 0, where U l,1 , V l,1 &#8712; R d&#215;2r and U l,2 , V l,2 &#8712; R d&#215;m are the first 2r and last m columns of U l , V l &#8712; R d&#215;d respectively.</p><p>The key claim to be shown here is that W l (t) = W l (t) for all l &#8712; [L] and t &#8805; 0. Afterwards, it follows straightforwardly from (20) that</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>EECS Department, University of Michigan, Ann Arbor, USA. Correspondence to: Can Yaras &lt;cjyaras@umich.edu&gt;.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>A majority of them are in fact zero, i.e., no change from pretrained weights.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>Of course linear activation has been considered throughout the history of artificial neural nets, but the fact that a multilayer network with linear activation has an equivalent one-layer network meant this architecture was summarily ignored. This is evidenced in<ref type="bibr">(Dalton &amp; Deshmane, 1991)</ref>: "In summary, it makes no sense to use a multilayered neural network when linear activation functions are used."</p></note>
		</body>
		</text>
</TEI>
