<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>APOLLO: SGD-like Memory, AdamW-level Performance</title></titleStmt>
			<publicationStmt>
				<publisher>https://doi.org/10.48550/arXiv.2412.05270</publisher>
				<date>02/17/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10631518</idno>
					<idno type="doi"></idno>
					
					<author>H Zhu</author><author>Z Zhang</author><author>W Cong</author><author>X Liu</author><author>S Park</author><author>V Chandra</author><author>B Long</author><author>D Z Pan</author><author>Z Wang</author><author>J Lee</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Large language models (LLMs) are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face critical challenges: (i) reliance on costly SVD operations; (ii) significant performance trade-offs compared to AdamW; and (iii) still substantial optimizer memory overhead to maintain competitive performance.In this work, we identify that AdamW's learning rate adaptation rule can be effectively coarsened as a structured learning rate update. Based on this insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM Optimization (APOLLO), which approximates learning rate scaling using an auxiliary low-rank optimizer state based on pure random projection. This structured learning rate update rule makes APOLLO highly tolerant to further memory reductions while delivering comparable pre-training performance. Even its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance compared to AdamW with SGD-level memory costs.Extensive experiments demonstrate that the APOLLO series performs on-par with or better than AdamW, while achieving greater memory savings by nearly eliminating the optimization states of AdamW. These savings provide significant system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training LLaMA-7B on a single GPU using less than 12 GB of memory with weight quantization.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">C</head><p>H  r ul e wit h a gr a di e nt s c ali n g f a ct or S = G t G t &#8712; R m &#215; n o v er t h e r a w gr a di e nt G t , i. e., di m e nsi o n of t h e w ei g ht t e ns ors. T h e el e m e nt-wis e s c ali n g f a ct or S = G t G t is t h e n si m pli fi e d i nt o a c h a n n el-wis e f or m at , s &#8712; R 1 &#215; n , w h er e e a c h el e m e nt s j f or e a c h c h a n n el j is: a nt s pi k e at t h e e arl y st a g e, w hi c h is d u e t o t h e u nst a bl e gr a di e nt at t h e e arl y st a g e. I nst e a d of a p pl yi n g t h e v a nill a gr a di e nt cli p pi n g m et h o d, w e us e t h e N or m-gr o wt h Li mit er ( N L) i n (C h e n et al. , 2 0 2 4 ) t o li mit t h e c o ns e c uti v e gr a di e nt gr o wt h, as it is s h o w n sli g htl y m or e eff e cti v e t h a n gr a di e nt cli p pi n g: if || G t || || G t -1 || &gt; &#947; t h e n G t &#8592; G t || G t || &#8226; &#947; || G t -1 || ( 4) w h er e &#947; is a t hr es h ol d t o e ns ur e t h at t h e r at e of gr a di e nt gr o wt h r e m ai ns c o ntr oll e d. T his a p pr o a c h li mits t h e m a g nit u d e of gr a di e nt n or m i n cr e as es, p arti c ul arl y f or t h e u nst a bl e gr a di e nts i n t h e e arl y m &#215; n &#8592; -&#8711; W &#981; t (W t ) if t m o d T = 0 t h e n P t &#8592; N s e e d ( 0, 1 / r ) s e e d &#8592; a n i n d e p e n d e nt n e w r a n d o m s e e d e n d if R t &#8592; P t G t # St e p 2: O bt ai n l o w r a n k o pti mi z ati o n st at es, M t , V t . M R t , V R t &#8592; A d a m W ( R t , &#946; 1 , &#946; 2 , &#955; = 0) R t &#8592; M R t / ( V R t + &#1013; ) # St e p 3: O bt ai n a p pr o xi m at e d gr a di e nt s c ali n g f a ct or. if A P O L L O t h e n S &#8592; di a g (s R 0 pr ess e d l o w-r a n k s p a c e r at h er t h a n t h e ori gi n al f ull-r a n k o n e, s h o wi n g i n Al g orit h m 1 . S p e ci fi c all y, a n a u xili ar y l o w-r a n k o pti mi z er st at e is st or e d b y t a ki n g t h e l o w-r a n k gr a di e nt R t as i n p ut, c o m p ut e d as R t = P t G t &#8712; R r &#215; n wit h a pr oj e cti o n m atri x P t &#8712; R r &#215; m . It will o nl y m ai nt ai n t h e l o w-r a n k v ersi o n of t h e first a n d s e c o n d m o m e nts as:</p><p>T h es e l o w-r a n k m o m e nts, M R t a n d V R t , ar e t h e n c o n v ert e d i nt o a li g ht w ei g ht, c h a n n el-wis e s c ali n g f a ct ors: Ori gi n al s p a c e:  </p><p>We q u a ntif y t h e a p pr o xi m ati o n err or i n t h e f oll o wi n g. (</p><p>Pr o of : Pl e as e r ef er t o A p p e n di x A. 1. 3 .  ( </p><p>F or a n y c h a n n el j , wit h pr o b a bilit y &#8805; 1 -&#948; :    <ref type="table">7 8. 1 8  0. 2 6 G  4 5. 5 1  0. 5 4 G  3 7. 4 1  1. 0 8 G  1 4 2. 5 3  3</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>M or e o v er, w e fi n d t h e t e ns or-wis e s c ali n g f a ct or esti m at e d i n a r a n k-1 s p a c e is t y pi c all y s m all er t h a n t h at o bt ai n e d wit h a hi g h er r a n k, w hi c h t h e t h e or e m c a n t h e or</head><p>er a g e A d a m W 1. 3 7 G 1 8. 8 0 0. 5 8 8 1 0. 4 7 2 9 0. 3 2 8 6 0. 5 3 3 5 0. 3 0 4 0. 3 6 1 5 0. 2 1 6 7 0. 6 3 8 7 0. 5 9 1 0. 2 0 4 7 0. 3 5 5 4 A P O L L O 0. 3 4 G 1 6. 8 5 0. 5 1 6 5 0. 4 7 2 9 0. 3 5 2 8 0. 5 1 4 6 0. 3 1 8 0. 3 7 9 2 0. 2 5 1 7 0. 6 6 3 2 0. 5 9 2 0. 2 1 8 8 0. 3 6 8 1 A P O L L O -M i n i 0. 0 0 G 1 7. 1 8 0. 5 4 3 4 0. 4 7 2 9 0. 3 4 8 1 0. 5 1 6 2 0. 3 2 0 0. 3 6 5 3 0. 2 4 7 4 0. 6 4 6 9 0. 5 9 1 0. 2 1 7 8 0. 3 6 5 4 S e q u e n c e L e n gt h: 1 0 2 4 M et h o d M e m or y P er pl e xit y B o ol   T h e e x p e ct ati o n of &#8741; P x &#8741; 2 is: (  </p><p>&#8805; 1 -2 e x p -r &#1013; 2 8 .</p><p>( ( </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ar e t h e s e c o n d m o m e nts i n t h e ori gi n al a n d pr oj e ct e d s p a c es, r es p e cti v el y.</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pr o of. O ur g o al is t o s h o w t h at t h e n or m of t h e s e c o n d m om e nt V t i n t h e ori gi n al</head><p>B y e x p a n di n g r e c ursi v el y, w e c a n writ e V t [:, j] as a w ei g ht e d s u m of t h e s q u ar e d gr a di e nts fr o m all p ast it er ati o ns: </p><p>) 2 E x p a n di n g r e c ursi v el y, w e h a v e: </p><p>We c a n s w a p t h e s u m m ati o n or d er a n d h a v e,</p><p>Si mil arl y, w e c a n h a v e</p><p>St e p 4: C o nst r u cti n g t h e B o u n ds f o r V R t [:, j] B y T h eor e m A. 1, w e k n o w t h at f or e a c h k , t h e &#8467; 2 n or m of t h e pr oj e ct e d gr a di e nt &#8741; R t -k [:, j]&#8741; s atis fi es:</p><p>Si mil arl y, w e c a n o bt ai n t h e l o w er b o u n d,</p><p>We o bt ai n t h e f oll o wi n g b o u n ds f or t h e &#8467; 1 n or m of f ull pr oj e ct e d s e c o n d m o m e nt V R t [:, j]: </p><p>A p pl y T h e or e m A. 2 f or t h e first p art, w e c a n o bt ai n t h e err or b o u n d f or t h e first p art: </p><p>We c a n e asil y a p pl y T h e or e m A. 3 f or t h e first-m o m e nt t er m: i n e q u ati o n 1 1 . T h e n it is a p pr o xi m at e d as,</p><p>M ulti pl y i n e q u aliti es fr o m t h e or e m A. </p><p>T h e n, w e h a v e t h e b o u n d e d r ati o,  1. 9 5 9 7 2 2 0. 0 0 9 2 4 2 0. 0 0 6 0 2 4 0. 0 0 4 0 1 7 5 1. 5 1 2 5 2 1 0. 0 0 0 5 0 9 0. 0 0 0 2 4 9 0. 0 0 0 1 0 7 1 0 2. 4 7 1 7 9 2 0. 0 0 0 2 4 2 0. 0 0 0 1 6 3 0. 0 0 0 0 5 6 2 0 3. 2 0 7 5 3 5 0. 0 0 0 3 9 9 0. 0 0 0 2 6 1 0. 0 0 0 1 0 1</p><p>dir e cti o n al s h ar p n ess i m pli es t h e p ossi bilit y of t a ki n g l ar g er eff e cti v e st e ps, p ot e nti all y yi el di n g a gr e at er l o c al d e cr e as e i n t h e o bj e cti v e. I n c o ntr ast, if t h e dir e cti o n al s h ar p n ess is l ar g e, w e h a v e n o c h oi c e b ut t o t a k e a ti n y st e p, as ot h er wis e t h e l oss w o ul d bl o w u p d u e t o t h e s e c o n d-or d er t er m. </p></div></body>
		</text>
</TEI>
