<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Fast Equilibrium of SGD in Generic Situations</title></titleStmt>
			<publicationStmt>
				<publisher>2024 International Conference on Learning Representations</publisher>
				<date>05/11/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10547404</idno>
					<idno type="doi"></idno>
					
					<author>Zhiyuan Li</author><author>Yi Wang</author><author>Zhiren Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Normalization layers are ubiquitous in deep learning, greatly accelerating optimization. However, they also introduce many unexpected phenomena during training, for example, the Fast Equilibrium conjecture proposed by (Li et al.,  2020), which states that the scale-invariant normalized network, when trained by SGD with η learning rate and λ weight decay, mixes to an equilibrium in Õ( 1 ηλ ) steps, as opposed to classical e O((ηλ) -1 ) mixing time. Recent works by Wang & Wang (2022); Li et al. (2022c) proved this conjecture under different sets of assumptions. This paper aims to answer the fast equilibrium conjecture in full generality by removing the non-generic assumptions of Wang & Wang (2022); Li et al. ( 2022c) that the minima are isolated, that the region near minima forms a unique basin, and that the set of minima is an analytic set. Our main technical contribution is to show that with probability close to 1, in exponential time trajectories will not escape the attracting basin containing their initial position.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Normalization layers are ubiquitous and play a fundamental role in modern deep learning, e.g., Batch Normalization <ref type="bibr">(Ioffe &amp; Szegedy, 2015)</ref>, Group Normalization <ref type="bibr">(Wu &amp; He, 2018)</ref>, Layer Normalization <ref type="bibr">(Ba et al., 2016)</ref>, and Weight Normalization <ref type="bibr">(Salimans &amp; Kingma, 2016)</ref>. Normalization layers not only greatly facilitate optimization and improve trainability, it also brings intriguing new optimization behaviors to neural networks. For example, <ref type="bibr">Li &amp; Arora (2020)</ref> showed that normalized networks can be trained by SGD with exponentially increasing learning rates, because training with exponentially increasing learning rates turns out to be equivalent to training with constant learning rates but with weight decay turned on, as shown in (2). Here x k &#8712; R d is the parameter of a neural network after the k-th step and is updated by</p><p>where &#955; and &#951; are respectively the weight decay parameter and the learning rate, and L B k is the loss function evaluated using a randomly chosen mini-batch B k .</p><p>The result of <ref type="bibr">Li &amp; Arora (2020)</ref> holds not only for normalized networks but more broadly for all scale invariant training losses, which is a popular abstraction of normalized networks in optimization analysis. Mathematically, scale invariance refers to the following property of the loss: L B (cx) = L B (x), &#8704;c &gt; 0, x &#8712; R d and every batch B.</p><p>Later, <ref type="bibr">Li et al. (2020)</ref>; <ref type="bibr">Wan et al. (2021)</ref> discovered that it is the intrinsic learning rate &#951;&#955; that controls the long-term convergence behavior for SGD on scale invariant loss with weight decay, (1). The approach that <ref type="bibr">Li et al. (2020)</ref> takes to study (1) is to approximate by the stochastic differential equation (SDE) model <ref type="bibr">Li et al. (2017;</ref><ref type="bibr">2019)</ref>, which is quite common in literature.</p><p>dX t = (-&#951;&#8711;L(X t ) -&#951;&#955;X t )dt -&#951;&#963;(X t )dB K t .</p><p>(2)</p><p>Here L is the average</p><p>is a d &#215; K matrix, and B K t is the K-dimensional Wiener process. Scale invariance of L B implies that L is scale-invariant and that &#963; is (-1)-homogeneous, i.e. L(cx) = L(x), &#963;(cx) = c -<ref type="foot">foot_0</ref> &#963;(x), &#8704;c &gt; 0, x &#8712; R d .</p><p>(3) <ref type="bibr">Li et al. (2020)</ref> further proposed the following Fast Equilibrium Conjecture for the SDE approximation of SGD. Conjecture 1.1. [Fast Equilibrium Conjecture] <ref type="bibr">(Li et al., 2020)</ref> If F (X, input) denotes the output of a neural network NN with parameters X, and X t denotes the value of SDE (2) at time t, starting from initial parameter X 0 . Suppose NN has normalization steps so that the F (X, input) is scaleinvariant in X, i.e. F (X, input) = F (cX, input) for all c &gt; 0. Then for all input values input, the probability distribution of F (X t , input) stabilizes to an equilibrium in O( 1 &#951;&#955; ) steps of SGD updates.</p><p>Experiments where the empirically observed rates of convergence are polynomial were contained in the original paper <ref type="bibr">Li et al. (2020)</ref> where the Fast Equilibrium Conjecture was first asked. The rate O( 1 &#951;&#955; ) is considered to be fast because according to Langevin dynamics, the time it takes to converge to the Gibbs equilibrium is of exponential order e O((&#951;&#955;) -1</p><p>2 ) . This can be done by following a similar analysis to those in <ref type="bibr">(Bovier et al., 2004;</ref><ref type="bibr">Shi et al., 2020)</ref>. The works by <ref type="bibr">(Bovier et al., 2004)</ref> and <ref type="bibr">(Shi et al., 2020)</ref> dealt with models without normalization, and the convergence times there are of order e O((&#951;&#955;) -1 ) . Using the similar method as in <ref type="bibr">(Bovier et al., 2004)</ref> and <ref type="bibr">(Shi et al., 2020)</ref>, when normalization is used the convergence time can be shown to be of order e O((&#951;&#955;) -1</p><p>2 ) . This is because <ref type="bibr">Li et al. (2020)</ref> proved that the intrinsic learning rate &#951;&#955; is replaced by an effective learning rate (&#947; <ref type="bibr">Li et al. (2020)</ref>) for the renormalized parameter vector, which is of order O((&#951;&#955;)</p><p>The recent paper <ref type="bibr">(Li et al., 2022c)</ref>, using a mathematical framework from <ref type="bibr">(Li et al., 2022b)</ref>, established the fast equilibrium conjecture for &#951;&#955; &#8594; 0 under a mixed set of generic and non-generic assumptions. See <ref type="bibr">(Damian et al., 2021)</ref>, <ref type="bibr">(Gu et al., 2022)</ref> for more work on analyzing the dynamics of SGD near the manifold of minimizers. The goal of the current paper is to remove the non-generic ones and thus provide a general proof in the aforementioned range of parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">NOTATIONS AND ASSUMPTIONS</head><p>To introduce assumptions from previous authors as well as our results, we need to set up a few notations first. Let &#915; &#8838; R d \{0} be the set of local minima of L. Notice that by (3), &#915; is a cone, i.e.</p><p>x &#8712; &#915; if and only if cx &#8712; &#915; for all c &gt; 0. For all r &gt; 0, write &#915; r = {x &#8712; &#915; : |x| = r}. In particular &#915; 1 is a subset of the unit sphere S d-1 = {|x| = 1}.</p><p>In general, &#915; may have multiple connected components. Decompose &#915; = &#915; i where each &#915; i is a connected cone. We then write &#915; i r = &#915; i &#8745; {|x| = r}. Then &#915; i 1 are the connected components of &#915; 1 . In particular, there are only finitely many &#915; i 's and we index them by i = 1, &#8226; &#8226; &#8226; , m.</p><p>In addition to the scaling properties (3) guaranteed by the use of normalization, <ref type="bibr">(Li et al., 2022c)</ref> made certain assumptions, which we will need in the following. Assumption 1.2. The functions L and &#963; satisfy: (ii). (Regular critical locus) Each loss function L B k is C 4 on R d \{0}, the critical points of L form a C 2 submanifold &#8486;. 1 For all x &#8712; &#8486;, &#8711; 2 L(x) is of rank d -dim T x &#8486;.</p><p>(iii). (Controllability) For all x &#8712; &#915; 1 , span{&#8706;&#934;(x)&#963; k (x)} K k=1 = T x &#915; 1 . Here and below, &#934;(x) = lim t&#8594;&#8734; X t , with X t being the solution to the deterministic gradient descent X t = -&#8711;L(X t ) with initial value x.</p><p>By <ref type="bibr">(Arora et al., 2022, Lemma B.15)</ref>, under Assumption 1.2.(i) &amp; (ii), &#934; is well defined and C 2differentiable on a neighborhood of &#915; as long as L is C 4 differentiable. We also note that in general the noise structure does affect the convergence rate. But as long as Assumption 1.2 (iii) is satisfied, the noise structure won't affect the asymptotic order of the convergence rate.</p><p>On the other hand, we will not need the following assumptions.</p><p>Assumption 1.3. &#915; satisfies:</p><p>(i). (Unique basin) &#915; 1 is compact and connected;</p><p>(ii). (Analyticity) &#915; is a real analytic manifold and Tr &#931; is a real analytic function on R d \{0} where &#931; = &#963;&#963; &#8868; .</p><p>Restricting to an attracting basin U of &#915; and assuming both Assumptions 1.2 and 1.3, <ref type="bibr">Li et al. (2022c)</ref> proved Conjecture 1.1 when &#955;&#951; &#8594; 0 in the natural range of &#951; &#8804; O(&#955;) &#8804; O(1) and the parameter X 0 is initialized within U . Note that since U is an attracting basin, &#8486; and &#915; coincide in U and thus Assumption 1.2 is equivalent to <ref type="bibr">(Li et al., 2022c, Assumption 2.1)</ref> for the purpose of that paper. Note that &#915; is always a submanifold of &#8486;.</p><p>Remark 1.4. All three assumptions in Assumption 1.2 are very natural for the following reasons:</p><p>&#8226; As remarked earlier, the scale-invariance (3) is a consequence of the use of normalization steps inside neural networks.</p><p>&#8226; It is a widely used assumption, at least in the case of local minimizers, that the locus is a manifold for overparametrized neural networks, for example in <ref type="bibr">(Fehrman et al., 2020;</ref><ref type="bibr">Arora et al., 2022;</ref><ref type="bibr">Li et al., 2022b)</ref>. For the locus of global minimizers, this assumption was proved by <ref type="bibr">Cooper (2021)</ref>. As remarked in <ref type="bibr">(Li et al., 2022b;</ref><ref type="bibr">Cooper, 2021)</ref>, a main reason for the local minimizers to form manifolds is the overparametrization of modern neural networks. S &#184;ims &#184;ek et al. ( <ref type="formula">2021</ref>) further identifies the reason as symmetries arising from overparametrization. In fact, they studied loci of critical points that are not necessarily minima and proved that symmetry-induced critical points form a manifold that satisfies Assumption 1.2.(ii).</p><p>&#8226; The philosophy behind Assumption 1.2.(iii) is that the generation of random batches in training is independent of the aforementioned symmetries in the setup of the neural network, and thus generically should not live in subspaces that are invariant under such symmetries. In particular, the same symmetries are generically capable of move the noises from the random batches to span a tangent space of the same dimension as that of the local manifold of critical points.</p><p>Remark 1.5. On the other hand, both conditions in Assumption 1.3 are non-generic:</p><p>&#8226; Certain evidences from <ref type="bibr">(Draxler et al., 2018)</ref> suggests that all local minima appearing in realistic training generically come from connected relatively flat region of small variation in height, so empirically Assumption 1.3.(i) could be a reasonable approximate assumption. However, the experiments in <ref type="bibr">(Draxler et al., 2018, Fig. 5)</ref> shows at the same time that in many settings, this region is not completely flat and contains non-trivial saddle points. In particular, there could be multiple disconnected basins. In light of these, it is more reasonable work in the absence of assumption Assumption 1.3.(i).</p><p>&#8226; The analyticity of the &#915; and Tr &#931; depends on that of the activation functions chosen in the neural network. While many popular activation functions are real analytic, one may always choose to use functions that are differentiable but not analytic, in which case Assumption 1.3.(ii) is in general not guaranteed.</p><p>In this paper, we will give a general proof of Conjecture 1.1 in the same natural range as in <ref type="bibr">(Li et al., 2022c)</ref>, assuming only the generic conditions from Assumption 1.2. In particular, we will introduce two arguments that respectively remove both hypothesis (Unique basin and Analyticity) in Assumption 1.3.</p><p>Here we would like to provide comments on why Assumption 1.3 are restrictive. Assuming analyticity is restrictive because the regularity of the loss function is decided by that of the activation function. Even though popular activation functions such as Sigmoid are analytic, a priori one could use smooth but not analytic functions. The one basin assumption is restrictive as we do not see empirical evidence of proof that L only has one basin. In fact, the experiments at the end of the paper suggests that there are multiple basins.</p><p>We would also like to give remarks on why the three assumptions in Assumption 1.2 are essential. (i) is essential because without this assumption, the SDE would not be equivalent to a SDE on the sphere S d-1 , which is crucial to our analysis. Without this assumption, similar analysis can probably be formulated on R d instead of the S d-1 coordinate but there will be new technical obstacles to overcome. Since the original fast equilibrium conjecture was asked for normalized neural nets, we restrict our study to the current setting. (ii) is important because if not a trajectory may stay near a critical point (for example a saddle point) for a very long period of time, it would not be able to converge within a polynomial time. Finally, the reason why we need (iii) is that if the span is not the whole tangent space, but instead a subspace of the tangent space, then the diffusion will be restrained to this subspace, which a priori may be very fractal and existing mathematical theory is not sufficient to guarantee a unique equilibrium in limit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">MULTIPLE EQUILIBRIA WHEN BASIN IS NOT UNIQUE</head><p>It is worthy to explain in more details what happens when the basin is not unique, i.e &#915; has multiple connected components and Assumption 1.3.(i) fails. In this situation, our analysis generalizes the work of Wang &amp; Wang (2022) and reveals a three-stage equilibrium phenomenon. The most important property of this phenomenon is the mismatching between practical training and theoretical bounds: the equilibrium distribution of network parameters observed in the time window under a realistic budget is both local in space and temporary in time. It is concentrated near the bottom of the same attracting basin containing the initial parameter, and differs from the eventual global Gibbs equilibrium that the distribution of parameters will eventually converge to in exponentially long time. This phenomenon interprets the gap between the empirically based Conjecture 1.1 and the previous theoretical estimate from e.g. <ref type="bibr">(Bovier et al., 2004;</ref><ref type="bibr">Shi et al., 2020)</ref>. See <ref type="bibr">(Frankle et al., 2020)</ref>, <ref type="bibr">(Gupta et al., 2019)</ref> for more work about the iterates stay in the same basin for a significant amount of time when starting from the same initialization.</p><p>The major short-come of (Wang &amp; Wang, 2022) is that, while not relying on the uniqueness of the basin, the arguments therein are subject to other non-generic assumptions, namely: (1) all basins are isolated points; (2) the noise &#963; is a standard isotropic Gaussian noise.</p><p>Our methods allow to remove these assumption simultaneously together with Assumption 1.3. This is made possible by avoiding using the semi-classical analysis of spectra of differential operators, which was developed by Simon <ref type="bibr">(Simon, 1983)</ref> and used in an essential way by previous authors in <ref type="bibr">(Bovier et al., 2004;</ref><ref type="bibr">Shi et al., 2020;</ref><ref type="bibr">Wang &amp; Wang, 2022)</ref>. Instead, our method is purely probabilistic and predicts that the exiting time from a given basin is exponentially long. This method is based on an adaptation of the large deviation principle of <ref type="bibr">Dembo &amp; Zeitouni (2010)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">STATEMENT OF MAIN RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">PRELIMINARIES ON SDE MODEL</head><p>A polar coordinate system has been adopted in <ref type="bibr">(Li et al., 2020)</ref> to study the SDE model (2). For this purpose, denote by X t = Xt |Xt| the unit renormalization of X t , and</p><p>. By <ref type="bibr">(Li et al., 2020, Theorem 5</ref>.1), ( <ref type="formula">2</ref>) is equivalent to</p><p>Recall that &#931; = &#963;&#963; &#8868; is a d &#215; d positive semidefinite symmetric matrix.</p><p>One may view the motion of X t as an intrinsic one inside the unit sphere S d-1 , instead of one inside R d . From this perspective, (Wang &amp; Wang, 2022, Theorem 3.1) shows that (4) can be rewritten as an intrinsic SDE on S d-1</p><p>where &#963;(&#8226;)</p><p>1 2 is a tensor field along S d-1 whose value is given by the restriction of &#963;(&#8226;)</p><p>1 2 and &#8711; is the gradient operator on S d-1 . (See the remark after (Wang &amp; Wang, 2022, Theorem 3.1) for the meaning of being intrinsic. In particular, &#8711;L(X t )dt, &#963;(X t )dB K t are viewed as vector fields along S d-1 .) Remark 2.1. Instead of the term &#963;(X t )dB K t in (4) and ( <ref type="formula">6</ref>), the papers <ref type="bibr">(Li et al., 2020;</ref><ref type="bibr">Wang &amp; Wang, 2022</ref>) actually used the restriction of &#931;(X t ) 1 2 dB d t to S d-1 . However, these two expressions are equivalent as Wiener processes because &#931; = &#963;&#963; &#8868; .</p><p>Since ( <ref type="formula">6</ref>) is a perturbation with Brownian noise of the gradient flow dX t = -&#947; -1 2 &#8711;L(X t ) with varying learning rate &#947; -1 2 t , it makes sense to first understand the constant speed gradient flow X t = -&#8711;L(X t ), (7) which is an ODE on the compact manifold S d-1 . Following earlier notation, the local minima of L on S d-1 is &#915; 1 and has connected components</p><p>for the attracting basin of &#915; i 1 , i.e. the set of X 0 &#8712; S d-1 such that the solution X t to (7) with initial value X 0 satisfies lim t&#8594;&#8734; X t &#8712; &#915; i 1 ). Lemma 2.2. Under Assumption 1.2.(ii), the</p><p>The proof of Lemma 2.2 is standard and we left it to the reader. The key observation is that the complement</p><p>1 is the union of attracting basins of the connected components of critical points that are not local minima. By Assumption 1.2.(i), those critical points are saddle like and their attracting basins are proper submanifolds.</p><p>Note that L is constant on &#915; i 1 and &#915; i , and</p><p>Then U i the attracting basin of &#915; i under the gradient flow X t = -&#8711;L(X t ).</p><p>(8) By <ref type="bibr">(Arora et al., 2022, Lemma B.15)</ref>, under Assumption 1.2.(i), U i is open and the function &#934;(x) = lim t&#8594;&#8734; X t is C 2 -differentiable on U i .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">MAIN RESULT</head><p>Definition 2.3. We define the Lipschitz distance between two probability measures &#181;, &#957; on a metric space X as</p><p>where the Lipschitz norm of a function &#966; is given by &#8741;&#966;&#8741;</p><p>.</p><p>We are now able to state our main theorem, which is a mutual reinforcement to both <ref type="bibr">(Li et al., 2022c, Theorem 5.5</ref>) and (Wang &amp; Wang, 2022, Theorem 4.6). Theorem 2.4. Under Assumption 1.2, for all &#1013; &gt; 0 and compact interval [&#961; -, &#961; + ] &#8834; (0, &#8734;), there exist a constant c &gt; 0 and a set &#923; &#8838;</p><p>|x0| &#8712; &#923;, all growth rates K such that K &#8594; &#8734; as &#951;&#955; &#8594; 0, and all time values</p><p>the random trajectory to (2) with initial value x 0 satisfies dist P X0=x0 (X t ), &#957; i &lt; &#1013;, where &#957; i is a probability measure supported on the attractor &#915; i of the unique attracting basin U i containing x 0 , and &#957; i only depend on L, &#963; and i.</p><p>Here vol S d-1 is the renormalized volume on the sphere S d-1 so that the total mass of S d-1 is 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">REMOVAL OF ANALYTICITY ASSUMPTION</head><p>In this part, we will prove that Assumption 1.3.(ii) on analyticity (which is <ref type="bibr">(Li et al., 2022c, Assumption 5.3</ref>)) is unnecessary for <ref type="bibr">(Li et al., 2022c, Theorem 5.4</ref>), and thus <ref type="bibr">(Li et al., 2022c, Theorem 1.2</ref>) holds without such an assumption as well. Proposition 3.1. Under Assumption 1.2 and Assumption 1.3.(i), the conclusion of <ref type="bibr">(Li et al., 2022c</ref>, Theorem 1.2) hold.</p><p>For this purpose, we temporarily adopt the setting of <ref type="bibr">(Li et al., 2022c)</ref> for now. In other words, Assumption 1.3.(i) is being assumed and &#915; 1 is a compact connected submanifold of S d-1 and &#915; = {rx : r &gt; 0; x &#8712; &#915; 1 }. By <ref type="bibr">(Li et al., 2022c, Theorem 4</ref>.1) it suffices to prove the (qualitative) mixing of the SDE <ref type="bibr">(Li et al., 2022c, Equation (13)</ref>) on &#915; towards a unique invariant probability measure.</p><p>Using the notations from <ref type="bibr">(Li et al., 2022c, Chapter E.3)</ref>, this SDE writes:</p><p>where f k are certain vector fields along the radial cone</p><p>and they span T x &#915; &#8745; x &#8741; = T x &#915; |x| for all x &#8712; &#915;, where &#915; r = {x &#8712; &#915; : |x| = r}, and f 0 has the form</p><p>In the proofs from <ref type="bibr">(Li et al., 2022c)</ref>, analyticity is only used in Chapter F.4, when Tr &#931; is not a constant on &#915; 1 . In this case, it was proved there (without using analyticity) that &#915; * := {y &#8712; &#915; :</p><p>is the unique invariant control set for the control problem corresponding to (9).</p><p>Instead of using Kliemann's condition <ref type="bibr">(Kliemann, 1987)</ref>, which requires the vector field Lie algebra l generated by f 0 , &#8226; &#8226; &#8226; , f N to be of maximal dimension at all points in &#915; * and is the reason for the need of analyticity in <ref type="bibr">(Li et al., 2022c)</ref>, we will use Arnold-Kliemann's condition <ref type="bibr">(Arnold &amp; Kliemann, 1987)</ref>, which only requires that the vector field Lie algebra l to be of maximal dimensional at one point in &#915; * . This condition is true because the projection of f 0 to the radial direction is not constantly 0 if Tr &#931; is not a constant. (Otherwise &#915; * wouldn't be the unique invariant set.) Under this condition, <ref type="bibr">(Arnold &amp; Kliemann, 1987, Theorem 5</ref>.1) proved that there is a unique invariant probability measure &#957; supported on &#915; * . The measure &#957; then has to be ergodic. Moreover, <ref type="bibr">(Arnold &amp; Kliemann, 1987, Theorem 5.2)</ref> showed that &#957; is absolutely continuous with respect to the Riemannian volume on the manifold &#915;. In particular, &#957;(&#8706;&#915; * ) = 0 and &#957;(&#915; * ) = 1.</p><p>The main issue is to prove the convergence of the distribution S t,x towards &#957; as t &#8594; &#8734;, where S t,x denotes the measure of all trajectories of solutions to (9) at time t starting from x &#8712; &#915; * . A priori, such convergence is only known to hold for 1 T T 0 S t,x dt by ergodic theorem. Applying <ref type="bibr">(Duflo &amp; Revuz, 1969, Theorem II.4</ref>) to (&#915; * , &#957;), it suffices to check two conditions to guarantee S t,x &#8594; &#957; (in total variation distance):</p><p>(Harris's recurrence condition) For all sets A with &#957;(A) &gt; 0 and all x &#8712; &#915; * ,</p><p>Let us first verify (Harris's recurrence condition). Define D &#8834; &#915; * by</p><p>Then D is open in &#915; * . For all x &#8712; &#915; * and A &#8838; &#915; * , we define the random variable &#964; x,A &#8805; 0 to be the first entering time into A for a trajectory of (9) starting at y. Lemma 3.2.</p><p>Proof of (Harris's recurrence condition). By <ref type="bibr">(Arnold &amp; Kliemann, 1987, Theorem 6</ref>.1), ( <ref type="formula">11</ref>) holds for all x &#8712; D &#8745; int&#915; * = D. By Lemma 3.2, it then holds for all x &#8712; &#915; * . This verifies Harris's recurrence condition.</p><p>We now verify the (Regularity condition): In addition to the Lie algebra l and set D, define a Lie algebra l 0 &#8838; T &#915; and an open set D 0 by</p><p>It is easy to see that l 0 is indeed a Lie algebra and D 0 is a relatively open subset in &#915; * Lemma 3.3. The set D 0 has non-empty interior.</p><p>We postpone the proofs of Lemma 3.2 and 3.3 to Appendix A. Lemma 3.4. For all x &#8712; D 0 and t &gt; 0, S 0 t,x (&#915; * ) &gt; 0.</p><p>Proof. By <ref type="bibr">(Ichihara &amp; Kunita, 1974</ref>, Lemma 2.1), at every x &#8712; intD 0 , the second order differential operator on the right hand side of ( <ref type="formula">9</ref>) is elliptic (non-degenerate) at x. The lemma follows.</p><p>Proof of (Regularity condition). By (Harris's recurrence condition) and Lemma 3.3, for all x &#8712; &#915; * , P(&#964; x,D0 &lt; &#8734;) &gt; 0, and thus there exists t 0 (x) &gt; 0 such that P(&#964; x,D0 &lt; t 0 (x)) &gt; 0. In other words, on a subset &#8486; x,D0 &#8834; &#8486; of stochastic incidences &#969; with P(&#8486; x,D0 ) &gt; 0, there exists</p><p>) (&#915; * ) &gt; 0. This shows that the statement "For &#957;-a.e. x, S 0 t,x (&#915; * ) = 0 for all t &gt; 0" is false. By <ref type="bibr">(Duflo &amp; Revuz, 1969, Proposition, p235)</ref>, this guarantees (Regularity condition).</p><p>We have by now completed the proof of the mixing property S t,x &#8594; &#957; under the generic Assumption 1.2, and the Assumption 1.3.(i).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">REMOVAL OF UNIQUE BASIN ASSUMPTION</head><p>We now stop assuming Assumption 1.3.(i) and decompose &#915; = &#915; i where each</p><p>] is an invariant control set of (2). Moreover, it was proved in <ref type="bibr">(Li et al., 2020)</ref> that for a given initial radius |x 0 |, the radius |X t | of (2) starting at x 0 will be almost surely inside</p><p>Then &#915; i 1 are the connected components of the manifold &#915; 1 . In particular, there are only finitely many &#915; i 's. Write</p><p>Fix from now on a sufficiently small parameter p 0 such that U i 1,p0 are disjoint for distinct i's, and &#915; i 1 is the set of all critical points of L inside U i 1,10p0 . This is possible because of Assumption 1.2.(ii). Proposition 4.1. For all sufficiently small p 1 (to be determined later), given K &gt; 1, &#1013; &gt; 0, there exists a subset</p><p>|x0| &#8712; &#923; K,&#1013; , with probability &gt; 1 -&#1013;, the trajectory of (2) starting at x 0 will remain inside</p><p>Published as a conference paper at ICLR 2024</p><p>The proof of the proposition, which we omit, is a simple combination of <ref type="bibr">(Li et al., 2020, Equation (7)</ref>) and the proof of (Wang &amp; Wang, 2022, Theorem 4.5).</p><p>The proposition allows us to assume that our initial point is inside one of finitely many basins U i #,p1 . To prove the main result, it now suffices to make two observations: First, with very high probability, the trajectory will not escape from the basin in exponential time. Second, as long as the trajectory remains in the basin, its distribution always mixes towards a unique probability measure &#957; i supported at the bottom &#915; i of the basin.</p><p>The first property is stated as Proposition 4.2 below. Proposition 4.2. There exist C &gt; 0, and sufficiently small p 0 &gt; p 1 &gt; 0, such that for all i and</p><p>The convergence is uniform with respect to the inital data x 0 .</p><p>It is similar in nature to (Wang &amp; Wang, 2022, Lemma E.6) but requires a more sophisticated proof. This is the main theoretical component of this paper. The argument will be based on the large deviation principle of Dembo-Zeitouni in <ref type="bibr">Dembo &amp; Zeitouni (2010, Chapter 5)</ref>, which was an adaptation of <ref type="bibr">(Freidlin &amp; Wentzell, 2012, Chapter 6)</ref>. The reason for which Freidlin-Wentzell's original theory cannot be applied here like in (Wang &amp; Wang, 2022) is that the diffusion in the SDE system ( <ref type="formula">44</ref>), ( <ref type="formula">45</ref>) is degenerate. Dembo-Zeitouni's work allows degenerate diffusions. However, further modifications to <ref type="bibr">(Dembo &amp; Zeitouni, 2010)</ref> are needed in our case as the first order drift in (45) depends on the &#947; t = |X t | 4 &#951; -2 . We will treat &#947; t as a control variable. The full proof will be in Appendix B.</p><p>The second property follows from the main results (Theorem 5.1 &amp; Theorem 6.7) in <ref type="bibr">(Li et al., 2022c)</ref>) and is restated as Proposition 4.3 here with an additional emphasis on uniformity. A more detailed discussion can be found in Appendix C. Recall that <ref type="bibr">(Li et al., 2022c)</ref>) also assumes Assumption 1.3.(ii), but that can be dropped by the discussion in Chapter 3 above. Proposition 4.3. Under Assumption 1.2 For all K &gt; 0 and sufficiently small p 0 &gt; p 1 &gt; 0, such that in the regime &#951; &#8804; O(&#955;) &#8804; O(1) and &#951;&#955; &#8594; 0, the following holds: For each index i and for all initial parameter x 0 &#8712; U i #,p1 , the distribution of all trajectories X t 0&#8804;t&#8804;</p><p>that do not leave U i #,p1 converges in distribution to the trajectories { Xt } to a fixed SDE model (the Katzenberger model) supported on &#915; i with initial position &#934;(x 0 ). The convergence is uniform in x 0 . Moreover, as K &#8594; &#8734; the trajectories { Xt } are uniformly mixing (with respect to different x 0 's) towards a fixed equilibrium measure &#957; i .</p><p>Our main theorem, Theorem 2.4, then follows from combining Propositions 4.1, 4.2 and 4.3.</p><p>Proof of Theorem 2.4. By Proposition 4.1, after ignoring O( <ref type="formula">1</ref>&#951;&#955; ) time at the beginning, as well as an o(1) portion of stochastic incidences. One may assume x 0 &#8712; U i #,p for some i. For t within the range in the statement of Theorem 2.4, again after ignoring an o(1) portion of incidences, one may assume that all trajectories under consideration stay within U i #,p up to time t. Because of the lower bound for t, one may consider a window of length K log(1+ &#955; &#951; ) &#951;&#955; that ends at t where K &#8594; &#8734;. The distribution of trajectories along this window is the average of distributions over different initial positions. By Proposition 4.3, all such components uniformly mix toward &#957; i . The proof is completed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTS</head><p>Mixing on local manifold: The key technical observation of this paper (Proposition 4.2) is that the distribution of trajectories with an initial position is trapped locally in the attracting basin containing the initial position during any practical observation windows. Using the method from (Wang &amp; Wang, 2022, Fig. <ref type="figure">13</ref>), this observation is supported by the experiment below: using a reduced MNIST dataset with only 1280 samples and a small CNN with 1786 parameters (so that the model is still overparametrized), we ran 15 independent instances of SGD, at &#955; = &#951; = 1 32 , for each of two randomly chosen initial parametrizations. Each instance lasts 0.8 million steps of SGD. A smilar experiment was ran for reduced CIFAR10 dataset with 1280 samples, a CNN model with 2658 parameters, &#951; = 1 1024 , &#955; = 1 32 and 1.28 million SGD steps. In order to show that the distribution arising from each initial position does stabilize toward an equilibrium and the two equilibria are different, we track the variance within each group, and compare them with the average distance square over all pairs of point s from different groups. Namely, denoting by {X k i,t } the i-th trajectory starting at initial point x k where k = 1, 2, we compute the following quantities: <ref type="figure">2</ref> shows that V 11 and V 22 stabilizes near similar but different values, but V 12 stabilizes at a much bigger value. This suggests that the distributions of trajectories with starting point x 1 and x 2 mixes towards equilibria whose support have similar scales, but these two equilibria are far apart from each other.   Figure 4: small CIFAR: Comparison between training losses before and after SWAP</p><p>Prediction on the failure of stochastic weight averaging in parallel (SWAP): Our theory predicts that if the local minima manifold of minimizes &#915; i has non-trivial geometry, that is, the average of parameters on the manifold may fall off the manifold, then it might fail to decrease loss, or even increase the loss, once the SGD mixes to the local manifold.</p><p>We apply stochastic weight average over trajectories (SWAP) to the neural network parameters at each step over the 15 independent instances with the same initial position and compute the loss function at the averaged parameter. SWAP, a variant of stochastic weight average (SWA) from <ref type="bibr">Izmailov et al. (2018)</ref>, was proposed by <ref type="bibr">Gupta et al. (2020)</ref>. Figures <ref type="figure">3</ref>, <ref type="figure">4</ref> show that although the loss of the SWAP parameter improves the average loss over the independent instances at the beginning, the improvement quickly breaks after a couple thousands of training steps. This phenomenon verifies our theoretical prediction and also suggests that the support of the equilibrium is not a convex set but rather a manifold of curved shape.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION</head><p>We give rigorous proof of fast equilibrium conjecture in generic situations, removing previous assumptions that there is only one basin and the set of minima is analytic. The main technical contribution is that we justify most of the trajectories of SDE would not escape from one basin within exponential time. Instead of using spectral analysis, we adopt the large deviation principle type of argument. Possible interesting direction may include understanding the dependence of mixing time on dimension, architecture and noise structure.</p><p>7 ACKNOWLEDGEMENT Y.W. and Z.W. acknowledge respectively supports from NSF. Barry Simon. Semiclassical analysis of low lying eigenvalues. I. Nondegenerate minima: asymptotic expansions. Ann. Inst. H. Poincar&#233; Sect. A (N.S.), 38(3):295-308, 1983. ISSN 0246-0211. Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of neural network with normalization, weight decay, and sgd. In Advances in Neural Information Processing Systems, volume 34, 2021. Yi Wang and Zhiren Wang. Three-stage evolution and fast equilibrium for SGD with non-degenerate critical points. In Proceedings of the 39th International Conference on Machine Learning, pp. 23092-23113, 2022. Yuxin Wu and Kaiming He. Group normalization. In European Conference on Computer Vision, 2018.</p><p>A PROOFS FOR SECTION 3</p><p>Proof. of Lemma 3.2. Instead of ( <ref type="formula">9</ref>), define</p><p>where</p><p>thus the solutions Y and Y &#8741; to ( <ref type="formula">9</ref>) and ( <ref type="formula">12</ref>) starting at y coincide up to &#964; x,D . Moreover, the trajectories of ( <ref type="formula">12</ref> Proof. of Lemma 3.4. The proof is similar to that of <ref type="bibr">(Li et al., 2022c, Lemma F.18)</ref>. Recall that Tr &#931; is (-2)-homogeneous and assumed to be non-constant on &#915; 1 . Therefore, there is an open set</p><p>By homogeneity, this is also true on rV 1 &#8838; &#915; r for all r &gt; 0. As Tr &#931;(x) = -1 2 &#10216;x, f 0 (x) +</p><p>x 2 &#10217; and &#10216;x, f * &#10217; = 0, this implies &#10216;&#8711;&#10216;x, f 0 (x)&#10217;, f * &#10217; &#824; = 0 on rV 1 . By <ref type="bibr">(Li et al., 2022c, Lemma F.16)</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B EXITING TIME WITH DEGENERATE DIFFUSION AND EXTRA CONTROL</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VARIABLE</head><p>In this appendix, we establish a probabilistic lower bound to the exiting time of a stochastic process from a basin (Theorem B.21), based on a one-sided large deviation principle (Theorem B.12). Our proofs adapt those from the work of Dembo-Zeitouni <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Chapter 5</ref>) to a more general setting. The main differences in the setting are:</p><p>1. The stochastic process in the basin is now governed not only by the current location and an Brownian motion, but also by an extra control variable as stated in equation ( <ref type="formula">17</ref>);</p><p>2. The local minima set in the basin is no longer assumed to be a unique isolated fixed point.</p><p>The diffusion in the stochastic process is allowed to be degenerate, which was the main novelty in the Dembo-Zeitouni theory compared to the earlier work of Freidlin-Wentzell <ref type="bibr">(Freidlin &amp; Wentzell, 2012, Chapter 6)</ref>.</p><p>This appendix will solely consist of mathematical analysis. With the exception of &#167;B.4, all notations are chosen independently from those used in other parts of the current paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 PROPERTIES OF UPPER LARGE DEVIATION PRINCIPLE</head><p>In this section we define the notion of upper large deviation principle (upper LDP). This is the upper bound part of the large deviation principle defined in <ref type="bibr">(Dembo &amp; Zeitouni, 2010</ref>, Chapter 1.2). The principles proved in <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Chapter 4.1 &amp; 4.</ref>2), which allows to pass the LDP property between random processes, are still valid for upper LDP because the upper and lower bounds are treated separately in their proofs. The purpose of this section is to briefly list which facts are relevant and justify the survival of their proofs in <ref type="bibr">(Dembo &amp; Zeitouni, 2010)</ref> with upper LDP. Definition B.1. A rate function I is a lower semicontinuous mapping I : X &#8594; [0, &#8734;] on a metric space X , i.e. I -1 ([0, a]) is closed for all finite a. A good rate function is a rate function I such that I -1 ([0, a]) is compact for all finite a. The effective domain D I if I is I -1 ([0, &#8734;)). Definition B.2. A family of Borel probability measures {&#181; &#1013; } on (X , B) satisfies upper large deviation principle (upper LDP) with rate function I if for all measurable subsets A of X , lim sup &#1013;&#8594;0</p><p>A family of Borel probability measures &#181; &#1013; on (X , B) satisfies weak upper large deviation principle (weak upper LDP) with rate function I if ( <ref type="formula">13</ref>) holds for all compact subsets A.</p><p>For background, recall that {&#181; &#1013; } is said to satisfy the large deviation principle with rate function I if in addition to (13) it also satisfies the lower bound Remark about the proofs. Theorem B.3 and Theorem B.4 are the upper bound directions of <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Theorem 4.1.11 &amp; 4.2.1)</ref>. Their proofs are identical to those therein. For Theorem B.3, note that this direction only uses the equality (4.1.14), but not (4.1.12) and (4.1.13), in <ref type="bibr">(Dembo &amp; Zeitouni, 2010)</ref>.</p><p>Definition B.5. For families {&#957; &#1013;,m } and {&#957; &#1013; } of probability measures on a metric space Y, where m &#8712; N and &#1013; &gt; 0, we say {&#957; &#1013;,m } are exponentially good approximations of {&#957; &#1013; } if there exist probability spaces (&#8486;, B &#1013; , P &#1013;,m ) and two families of random variables y &#1013;,m , y &#1013; with joint distribution P &#1013; and marginal distributions &#957; &#1013;,m , &#957; &#1013; such that for all &#948; &gt; 0, the event dist(y &#1013;,m , y &#1013; ) &gt; &#948; is B &#1013; -measurable and</p><p>If in addition &#957; &#1013;,m = &#957; &#1013; is independent of m, we say { &#957; &#1013; } and {&#957; &#1013; } are exponentially equivalent. Remark about the proof. The proof is the same as that of <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Theorem 4.2.16</ref>). For part (a), by Theorem 3.3 applied to the topological base consisting of all metric balls B &#948; (y) in Y, {&#957; &#1013;,m } satisfies weak upper LDP with rate function</p><p>So it suffices to prove I(y) &#8804; I * (y). This was done in the proof of <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Theorem 4.2.16, part (a)</ref>) via the inequality</p><p>I m (z).</p><p>Hence</p><p>The proof of part (b) is verbatim as in <ref type="bibr">(Dembo &amp; Zeitouni, 2010)</ref>.</p><p>Theorem B.7. Suppose a family of probability measures {&#181; &#1013; } satisfies upper LDP with a good rate function I on a Hausdorff topological space X . And suppose a sequence of continuous maps {F m } from X to another Hausdorff topological space Y approximate a measurable maps F in the sense that for all a &lt; &#8734;, lim sup</p><p>Finally, assume the families {&#181; &#1013; &#8226;(F m ) -1 } are exponentially good approximations of another family of probability distributions {&#957; &#1013; } on Y. Then {&#957; &#1013; } satisfies upper LDP with good rate functions I &#8242; (y) := inf F -1 ({y}) I.</p><p>Remark about the proof. The theorem is the upper bound part of of <ref type="bibr">(Dembo &amp; Zeitouni, 2010</ref>, Theorem 4.2.23). The proof stay the same with (Dembo &amp; Zeitouni, 2010, Theorems 4.2.1 &amp; 4.2.16) replaced by their respective upper bounds direction, namely Theorem B.4 and Theorem B.6. B.2 UPPER LDP FOR DEGENERATE DIFFUSION WITH EXTRA CONTROL VARIABLE</p><p>In this part, we prove a variation of the upper bound direction in the large deviation principle proved in <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Theorem 5.6.7)</ref>. The notations in this section are self-contained and independent of those from other parts of this paper.</p><p>Throughout this section we will consider the following setting:</p><p>&#8226; b and &#963; are bounded, and h is bounded on [0, &#1013; 0 ] &#215; U &#215; D for some &#1013; 0 &gt; 0. We will fix &#1013; 0 and a common upper bound H for b, &#963; and h respectively on these domains.</p><p>Consider the following families of stochastic differential equations on (Y, Z) &#8712; U &#215; R l and &#1013; &gt; 0:</p><p>The main difference of our setting from that in <ref type="bibr">(Dembo &amp; Zeitouni, 2010)</ref> is the existence of the additional control variable Z &#1013; t , whose evolution depends on &#1013; in a less prescribed way. We will assume throughout this section, in addition to Assumption B.8, that: Assumption B.8. For all &#1013; &gt; 0 and initial values (y 0 , z 0 ) &#8712; U &#215; D, the solution (Y &#1013; t , Z &#1013; t ) to ( <ref type="formula">16</ref>) and ( <ref type="formula">17</ref>) starting at (y 0 , z 0 ) almost surely remains in U &#215; D for all t &gt; 0. Definition B.9. Given the functions b, &#963;, the upper bound H on |h|, and T &gt; 0. The associated path space S T is defined as the family of triples (f, g, u)</p><p>Here C Lip H ([0, T ], D) is the subspace in C 0 ([0, T ], D) of functions g with Lipschitz constant bounded by H, i.e. that satisfy</p><p>And W 1,2 is the square integrable Sobolev space of first order differentiability.</p><p>Proof. As C 0 ([0.T ], D) is a metric space, it suffices to show any sequence g (k) in C Lip H ([0, T ], D) has a convergent subsequence with limit in C Lip H ([0, T ], D) Because D is a bounded domain, the g (k) 's are uniformly bounded. As they are also uniformly Lipschitz with Lipschitz constant bounded by H, by Arzel&#224;-Ascoli Theorem we may assume g (k)  converges in C 0 to some g &#8712; C 0 ([0, T ], D). Then g would be H-Lipschitz continuous as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>From now on, C Lip</head><p>H ([0, T ], D) will be equipped with the C 0 topology without further notice. Definition B.11. Given a function f &#8712; C 0 ([0, T ], U ), the corresponding energy functional is</p><p>By convention, &#934; T (f ) = &#8734; if S T is empty. Theorem B.12. Given a closed subset A of the metric space A &#8834; C 0 ([0, T ], U ) and an initial value y 0 , for the solution (Y &#1013; t , Z &#1013; t ) to ( <ref type="formula">16</ref>) and ( <ref type="formula">17</ref>) with initial value (y 0 , z 0 ), the following inequality holds:</p><p>In order to prove Theorem B.12, we need more notations to study the Y &#1013; t while keeping the path Z &#1013; t fixed. For this purpose, we define some distributions of (Z &#1013; t , &#8730; &#1013;B d t ) and Y &#1013; t respectively. Definition B.13. Denote by &#181; &#1013; (y0,z0),T the joint distribution of (Z &#1013; t , <ref type="formula">16</ref>), ( <ref type="formula">17</ref>) with initial value (y 0 , z 0 ) at t = 0. Write &#955; &#1013; for the distribution of For all initial values (y 0 , z 0 ) &#8712; U &#215; D, Y &#1013; t , Z &#1013; t are progressively measurable processes. In consequence,</p><p>Lemma B.15. The function</p><p>1 satisfies upper LDP with rate function I.</p><p>Proof. We first check that I is a rate function. That is, it is lower semicountinuous on</p><p>, and thus by definition I(g, u) = I 0 (u) in this case. It was known by Schilder's Theorem ( <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Theorem 5.2.3</ref>)) that the function</p><p>is a good rate function, and in particular lower semicountinuous. Thus</p><p>Thus I is a good rate function and we want to show that it is good, i.e. I -1 ([0, a]) is compact for all finite a. Remark that</p><p>) and the second factor is compact as I 0 is a good rate function. Note that I 0 (u) = I(g, u). So it suffices to know that C Lip H ([0, 1], D) is a compact space in C 0 topology, which is the assertion of Lemma B.10.</p><p>Finally we need to show that (13) holds for {&#181; &#1013; } and I. The same inequality holds for {&#955; &#1013; } and I 0 , i.e. for all</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>For all measurable set</head><p>In the last inequality, we used the fact that A 0 contains the projection of A and that I(g, u) = I 0 (u). This completes the proof.</p><p>For all (g, u)</p><p>In addition, given an integer m &#8712; N, we also define Y &#1013;,m (g,u),t as the solution on [0, 1], also with initial value y 0 , to the following stochastic differential equation</p><p>We emphasize that Y &#1013; (g,u),t and Y &#1013;,m (g,u),t are deterministic once the pair (g, u) are given and all randomness comes from &#181; &#1013; (y0,z0),T . Lemma B.16. For all any &#948; &gt; 0 and all initial values y 0 &#8712; D,</p><p>Proof of Lemma B.16. Fix &#948; &gt; 0. Let &#8710; &#1013;,m (g,u),t := Y &#1013;,m (g,u),t -Y &#1013; (g,u),t , for any &#961; &gt; 0, define the stopping time &#964; &#1013;,m,&#961; := min(inf{t :</p><p>&#964; &#1013;,m,&#961; depends on (g, u), but we skip it to simplify the notation.</p><p>The process &#8710; &#1013;,n (g,u),t satisfies the SDE</p><p>By the uniform Lipschitz continuity of b and &#963;, and the boundedness of g t &#8712; D, there is a constant C such that for all t &#8804; &#964; &#1013;,m,&#961; ,</p><p>By <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Lemma 5.6.18)</ref>, for all &#1013; &#8712; (0, 1), &#948; &gt; 0,</p><p>where K is a constant independent of &#948;, &#1013;, &#961;, &#181; and m. Then</p><p>the lemma is proved if we show for all &#961; &gt; 0,</p><p>To prove this, recall that |b| and |&#963;| are bounded by a constant H, and</p><p>Therefore, for m &gt; H &#961; , sup</p><p>This guarantees (23) and proves the lemma.</p><p>Proof of Theorem B.12. First of all, notice that one can assume without loss of generality that T = 1 by rescaling the time interval [0, T ] to [0, 1]. To see this, notice that the rescaled Brownian motion</p><p>We will only deal with the [0, 1] interval below.</p><p>Define maps F m , F :</p><p>For F , the image f = F (g, u) instead satisfies f 0 = y 0 and</p><p>It is not hard to check by Lipschitz boundedness of b, &#963;, and the compactnes of</p><p>Moreover, ( <ref type="formula">21</ref>) and ( <ref type="formula">22</ref>) can be reformulated as Y &#1013;,m (g,u),t = F m (g, u) and Y &#1013; (g,u),t = F (g, u) respectively. ( <ref type="formula">26</ref>)</p><p>&#920; m is a function of t, and by the Lipschitz bounds on b and &#963;, </p><p>In other words, the families {&#181; &#1013; (y0,z &#1013; 0 ),1 &#8226; (F m ) -1 } are exponentially good approximations of the distribution of Y &#1013; (g,u),t with (g, u) &#8764; &#181; &#1013; (y0,z &#1013; 0 ),1 . Thus if we can prove: for all a &lt; &#8734;,</p><p>then by Lemma B.7 and Lemma B.15, the distribution of Y &#1013; (g,u),t with (g, u) &#8764; &#181; &#1013; (y0,z &#1013; 0 ),1 satisfies upper LDP with good rate function y &#8594; inf F (g,u)=y I(g, u), which is exactly &#934; 1 (y). Because for each different value of &#1013;, z &#1013; 0 is arbitrarily chosen in D and the distribution of <ref type="formula">16</ref>), ( <ref type="formula">17</ref>), this exactly yields the statement of the theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>It remains to show (27). Let</head><p>holds, then for f m = F m (g, u) and f = F (g, u) given in ( <ref type="formula">24</ref>), ( <ref type="formula">25</ref>),</p><p>for some constant C 0 by the bound on the coefficients and Cauchy-Schwarz inequality. Write &#951; m for the right hand side in (28). Then by Lipschitz continuity of b and &#963;, for some other constant C,</p><p>Thus by Gronwall's inequality, (&#8710; m t ) 2 &#8804; 4C 2 (1 + a)e 4C 2 (1+a)t (&#951; m ) 2 . The equality ( <ref type="formula">27</ref>) follows by letting m &#8594; &#8734;, This completes the proof.</p><p>The following is a strengthen version of Theorem B.12. Theorem B.17. Given a closed subset A of the metric space A &#8834; C 0 ([0, 1], U ) and an initial value (y * , z * ) &#8712; U &#215; V , for solutions (Y &#1013; t , Z &#1013; t ) to ( <ref type="formula">16</ref>) and ( <ref type="formula">17</ref>), the following inequality holds: lim sup</p><p>Proof of Theorem B.17. As in the proof of Theorem B.12, we can assume T = 1.</p><p>By Theorem B.6, it suffices to prove that for any family of points y &#1013; 0 &#8594; y * as &#1013; &#8594; 0 and arbitrary z &#1013; 0 &#8712; D, the family of distribution {&#957; &#1013;</p><p>, and</p><p>Moreover,</p><p>Remark that the coefficients &#945; &#1013; t , &#946; &#1013; t are progressively measurable processes with respect to the filter generated by the Brownian motion {B d t }. Moreover, by Lipschitz continuity of b, &#963;, h, for some constant</p><p>By applying <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Lemma 5.6.18)</ref> with &#964; 1 = 1, we know that there is another constant C such that for all &#961; &gt; 0, &#948; &gt; 0,</p><p>By letting &#961; &#8594; 0 first and then &#1013; &#8594; 0, it follows that lim sup &#1013;&#8594;0 &#1013; log P( sup</p><p>This shows {&#957; &#1013; (y &#1013; 0 ,z &#1013; 0 ),1 } is exponentially equivalent to {&#957; &#1013; (y * ,z &#1013; 0 ),1 }, which suffices to conclude the proof.</p><p>Corollary B.18. Given a closed subset A of the metric space A &#8834; C 0 ([0, 1], U ) and a compact set K &#8838; U , for solutions (Y &#1013; t , Z &#1013; t ) to ( <ref type="formula">16</ref>) and ( <ref type="formula">17</ref>), the following inequality holds: lim sup &#1013;&#8594;0 &#1013; log sup</p><p>. By Theorem B.17, for all y &#8712; K, there is a value &#1013; y such that for all y 0 &#8712; B &#1013;y (y) and 0 &lt; &#1013; &lt; &#1013; y ,</p><p>Cover K by finitely many balls of the form B &#1013;y i (y i ), then for all 0 &lt; &#1013; &lt; min i &#1013; yi ,</p><p>The proof is completed by letting M &#8594; inf f &#8712;A f0&#8712;K &#934; T (f ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 EXITING TIME FROM BASIN</head><p>In this section, it will be assumed, in addition to Assumption B.8, that: Assumption B.19. There are a C 2 function L : U &#8594; [0, &#8734;) and a bounded open set V &#8834; U such that:</p><p>(1) &#8711; b(y,z) L(y) &#8804; 0 for all (y, z) &#8712; V &#215; D and the equality holds if and only if L(y) = 0;</p><p>(2) L is strictly positive on &#8706;V .</p><p>Write V q = L -1 ([0, q)) &#8745; V and choose q 0 sufficiently small such that V q0 lies in the interior of V .</p><p>In particular, for all q &#8712; [0, q 0 ], L| &#8706;Vq &#8801; q. Definition B.20. Suppose 0 &lt; q &lt; Q &#8804; q 0 . For a solution (Y &#1013; t , Z &#1013; t ) of ( <ref type="formula">16</ref>), ( <ref type="formula">17</ref>) with initial value in V Q &#215; D, denote by &#964; &#1013; q,Q the first time Y &#1013; t hits V q &#8746; &#8706;V Q , and by &#964; &#1013; Q the first time Y &#1013; t hits &#8706;V Q .</p><p>Our goal is to prove the following main theorem: Theorem B.21. Under Assumptions B.8 and B.19, for all 0 &lt; q &lt; Q &lt; q 0 there exists I Q &gt; 0 such that for all, lim &#1013;&#8594;0 sup (y0,z0)&#8712;Vq&#215;D</p><p>The following lemma is an analogue to <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Lemma 5.7.22)</ref> Lemma B.22. For all 0 &lt; q &lt; Q &#8242; &lt; Q &#8804; q 0 , the stopping time &#964; &#1013; q,Q satisfies</p><p>Proof. Consider the solution Y 0 t to the deterministic flow</p><p>starting at y 0 . By Assumption B.19, L(Y 0 t ) is decreasing and must converge to 0 as t &#8594; 0, and</p><p>It would in turn follow that</p><p>For simplicity, write J &#1013; t := (Y &#1013; t , Z &#1013; t ), (Y 0 t , z 0 ) . Let M &lt; &#8734; be a common upper bound on the C 0 and the Lipschitz norms of the maps b, &#963; and h over the domains &#1013; &#8712;</p><p>By Gronwall's inequality,</p><p>Therefore, in order to make sup t&#8712;[0, T ] J &#1013; t &#8805; &#948;, we must have</p><p>and thus</p><p>T must hold for at least one of the row vectors &#963; k of &#963;.</p><p>The process</p><p>By the Burkholder-Davis-Gundy inequality (see e.g. <ref type="bibr">(Dembo &amp; Zeitouni, 2010</ref>, Chapter E)),</p><p>for an absolute constant C. And thus by the Chebyshev's theorem</p><p>From the earlier discussion, after summing over all k's, we obtain that</p><p>We deduce ( <ref type="formula">30</ref>) from ( <ref type="formula">31</ref>) by letting &#1013; &#8594; 0, which concludes the proof.</p><p>The following lemma is the analogue of <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Lemma 5.7.23)</ref>. Lemma B.23. For all &#948;, a &gt; 0 and bounded set K &#8834; U , there exists a constant T 0 = T 0 (&#948;, a, K) &gt; 0 such that lim sup &#1013;&#8594;0 &#1013; log sup</p><p>Proof. Without loss of generality, assume &#1013;, &#948; &#8712; [0, 1] with &#948; fixed and &#1013; varying. Let the stopping time &#950; &#1013; be the first time such that</p><p>and thus they all are uniformly bounded by a constant M . For all</p><p>As in the proof of <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Lemma 5.7.23)</ref>, it suffices to consider each row vector &#963; k of &#963;. The stochastic process</p><p>t 0 &#963; k (Y &#1013; s , Z &#1013; s )dB 1 s is equivalent to B 1 &#964; &#1013; k,t by the time change theorem (see (Dembo &amp; Zeitouni, 2010, Chapter E.2)) where &#964; &#1013; k,t = t 0 &#963; 2 k (Y &#1013; s , Z &#1013; s )ds. The function &#964; &#1013; k,t is increasing in t and almost surely &#964; &#1013; k,t &#8804; M 2 t if T 0 &#8804; &#950; &#1013; as the C 0 and the Lipschitz norms of b, &#963;, h are bounded by M before the stopping time &#950; &#1013; . By <ref type="bibr">(Dembo &amp; Zeitouni, 2010, Lemma 5.2.1)</ref>,</p><p>Summing over different k's, we conclude from the two inequalities above that for all T 0 &#8712; [0, &#948; 4M ]</p><p>It now suffices to take T 0 = min( &#948; 4M , &#948; 2 8d 2 M 2 a ). Definition B.24. For y 0 , y 1 &#8712; U and T &gt; 0, define an energy cost &#936; T (y 0 , y 1 ) by</p><p>The following lemma claims there is a minimum energy cost required to increase the loss function L between different levels. Lemma B.25. For all 0 &lt; q &lt; Q &#8804; q 0 , inf</p><p>Moreover, there is constant T q,Q &gt; 0 such that inf</p><p>Proof. Suppose for now y 0 , y 1 &#8712; V Q \&#8706;V q , and (f, g, u) &#8712; S T with f 0 = y 0 , f T = y 1 . Then by (18), dL</p><p>Because q &gt; 0, by Assumption B.19, as &#8711; &#8868; L(y) b(y, z) = &#8711; b(y,z) L(y) &lt; 0 for all y &#8712; V Q \V q and z &#8712; D. In particular, this also shows &#8711;L(y) &#824; = 0 for all y &#8712; V Q \V q . Since both V Q \V q and D are compact, There exists positive constant &#954; = &#954;(q, Q) &gt; 0 and &#951; = &#951;(q, Q) &gt; 0, such that</p><p>Integrating from 0 to T , we get</p><p>In order to show (32), note that if y 0 &#8712; V q and y 1 &#8712; &#8706;V Q , then L(y 1 ) -L(y 0 ) = Q -q &gt; 0. By (34) and Cauchy-Schwarz inequality,</p><p>This provides a positive lower bound for &#936; T (y 0 , y 1 ) that is uniform for T &gt; 0, y 0 &#8712; V q and y 1 &#8712; &#8706;V Q . This proves (32).</p><p>Published as a conference paper at ICLR 2024</p><p>We now prove (33). Note that for all y 0 , y 1 &#8712; V Q \V q , L(y 1 ) -L(y 0 ) &#8805; q -Q. By (34),</p><p>The last expression is uniformly positive for T &#8805; T q,Q := 2(Q -q)&#954; -1 &#951; -2 . Thus &#936; T (y 0 , y 1 ) is uniformly positive over the region given by T &#8805; T q,Q , y 0 , y 1 &#8712; V Q \&#8706;V q . This proves (33).</p><p>Lemma B.26. For all 0 &lt; q &lt; Q &#8804; q 0 and initial value (y 0 , z 0 ) &#8712; V Q &#215; D, the stopping time &#964; &#1013; q,Q satisfies lim</p><p>and in consequence it suffices to show lim</p><p>Assume, for the sake of contradiction, that (36) is false, then for some M &lt; &#8734; and any all k &#8712; N, there exists f k &#8712; C 0 ([0, kT q,Q ], V Q \V q ) with &#934;(f k ) &#8804; M , where T q,Q is given by Lemma B.25. After breaking f k into k segments on subintervals of length T q,Q , it follows that there exists f * k &#8712; C 0 ([0, T q,Q ], V Q \V q ) such that &#934;(f * k ) &#8804; M k . By taking limit in C 0 norm (which is permitted by Lemma B.10) and using the lower semicontinuity of the good rate function &#934;, there exists f * &#8712; C 0 ([0, T q,Q ], V Q \V q ) with &#934;(f * ) = 0. This implies &#936; T (y 0 , y 1 ) = 0 and thus it contradicts to the inequality (33) of Lemma B.25. This completes the proof.</p><p>Denote by I q,Q &gt; 0 the left hand side in (32). Lemma B.27. The solutions to ( <ref type="formula">16</ref>) and ( <ref type="formula">17</ref>) satisfy lim q&#8594;0 lim sup &#1013;&#8594;0 &#1013; log sup (y0,z0)&#8712;V2q&#215;D</p><p>Proof. Fix an arbitrarily small &#948;, and let I Q,&#948; := min(lim q&#8594;0 I q,Q -&#948;, 1 &#948; ). Note that the right hand side in (37) always exists because I q,Q is a decreasing function in q by construction. In particular, I 2q,Q &#8805; I Q,&#948; when q is sufficiently small depending on Q and &#948;.</p><p>By Lemma B.26, there exists a large T * = T * (q, Q, &#948;) &lt; &#8734; such that lim sup &#1013;&#8594;0 &#1013; log sup (y0,z0)&#8712;V2q&#215;D</p><p>In addition, inf</p><p>and thus by Corollary B.18, lim sup &#1013;&#8594;0 &#1013; log sup (y0,z0)&#8712;V2q&#215;D</p><p>Note that the event {Y &#1013; &#964; &#1013; q,Q &#8712; &#8706;V Q } is contained in the union of the events {&#964; &#1013; q,Q &gt; T * } and {sup t&#8712;[0,T * ] L(Y &#1013; t ) &#8805; Q}. Therefore, we obtain by combining ( <ref type="formula">38</ref>) and ( <ref type="formula">39</ref>) that the inequality lim sup &#1013;&#8594;0 &#1013; log sup (y0,z0)&#8712;V2q&#215;D</p><p>holds for all sufficiently small q. The lemma then follows by letting &#948; &#8594; 0.</p><p>We are now ready to establish Theorem B.21.</p><p>Proof of Theorem B.21. Choose q &lt; Q 2 to be sufficiently small. Define a sequence of stopping times</p><p>&#8706;V 2q }. By Lemma B.22, all these stopping times are finite almost surely. Write Y m = Y &#1013; &#964;m , which is a Markov chain.</p><p>Recall that by Lemma B.25 I q,Q &gt; 0 for all 0 &lt; q &lt; Q and is an decreasing function in q. Let I Q := 1 2 lim q&#8594;0 I q,Q &gt; 0 and fix 0 &lt; &#945; &lt; 1 4 I Q . By Lemma B.27, if we fix a sufficiently small q, then lim sup &#1013;&#8594;0 &#1013; log sup (y0,z0)&#8712;V2q&#215;D</p><p>By plugging in (Y &#1013; &#952;m , Z &#1013; &#952;m ) as the the value for (y, z), we deduce that there exists &#1013; 0 &gt; 0 such that for all 0 &lt; &#1013; &lt; &#1013; 0 and m &#8805; 1, sup</p><p>On the other hand, assuming &#1013; 0 is sufficiently small, applying Lemma B.23 with a = I Q + 2&#945;, &#948; = 1 2 q and K = V Q yields that, for some fixed T 0 &gt; 0 depending on q and Q and all &#1013; &#8712; (0,</p><p>For a given M , the event {&#964; &#1013; Q &#8804; M T 0 } is contained in the union of the events</p><p>Combining this fact and the inequalities (41), (42) yield sup (y0,z0)&#8712;Vq&#215;D </p><p>which tends to 0 as &#1013; &#8594; 0. This completes the proof of Theorem B.21.</p><p>B.4 PROOF OF PROPOSITION 4.2</p><p>Recall that the movement of (2) stays in &#915; # , i.e. |X t | &#8712; [R -, R + ] and can be characterized as the new model ( <ref type="formula">6</ref>), (5).</p><p>After applying a time change T = (&#951;&#955;)</p><p>we get the following system of equations on (X T , &#947;T )</p><p>To deduce (44), we used the standard fact that for any a &gt; 0, a 1 2 B d a -1 T and B d T are equivalent as Wiener processes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Note &#947;T stays in the fixed interval</head><p>as long as X t &#8712; &#915; # . As there are only finitely many basins, we may fix an index i without impact Proposition B.28. In addition, without loss of generality let us assume L = 0 on &#915; i . For our purpose, it might be better to measure the distance to &#915; i on U i #,p0 where p 0 &gt; p 1 is sufficiently small and in particular U j #,p0 are disjoint for distinct j's. For q &#8805; 0, we will write V i r,q = {x &#8712; U i r,p0 , L(x) &lt; q} and V i #,q = {x &#8712; U i #,p0 , L(x) &lt; q}. By Assumption 1.2.(ii), one may manipulate p 0 , q 0 , p 1 , q 1 (p 0 &gt; p 1 , q 0 &gt; q 1 ) such that</p><p>, and thus reformulate Proposition 4.2 as Proposition B.28. There exists c &gt; 0 such that if q 0 &gt; q 1 &gt; 0 are fixed, but q 1 is sufficiently small compared to q 0 , then in the regime &#951;</p><p>The convergence is uniform with respect to (y 0 , z 0 ).</p><p>Proof. In order to apply Theorem B.21 with domain V i 1,q0 and control domain D = [&#947; -, &#947;+ ], we first make the following remark.</p><p>Theorem B.21 is in the setting where the domain is an open neighborhood in R n , while our current V i q0 is a neighborhood in the sphere. This is not a problem because unless L is a constant function, &#915; i 1 is a proper subset of the sphere S d-1 . And V i q is also a proper subset in S d-1 for small q. One can then change coordinates and identify V i q with a subset of the Euclidean space.</p><p>This converts the problem to the equations ( <ref type="formula">16</ref>), ( <ref type="formula">17</ref>) with the following dictionary: &#1013; = (&#951;&#955;) 1 2 ; X T and &#947;T play the roles of Y &#1013; t , and Z &#1013; t respectively; b(y, z) = -z -1 2 &#8711;(y); &#963;(y, z) = z -1 2 &#963;(y); and h(&#1013;, y, z) = &#1013;(-4z + 2 Tr &#931;(y)).</p><p>It remains to check Assumptions B.8 and B.19. We start with the latter. The strictly positivity of L directly follows from the construction of L and the neighborhood V i q0 . The property (1) in Assumption B.19 holds because b is negatively proportional to the gradient &#8711; of L with respect to the spherical coordinates.</p><p>Unfortunately Assumption B.8 doesn't automatically hold as trajectories may escape from V i 1,q0 . In order to adapt to this case, we smoothly modify the value of &#963; on V i 1,q0 so that it remains unchanged on V i 1,q1 and vanishes near &#8706;V i 1,q0 . Then near the boundary, the SDEs ( <ref type="formula">16</ref>), ( <ref type="formula">17</ref>) become deterministic. Because of Assumption B.19.(1), the trajectories do not escape from &#8706;V i 1,q0 . Moreover, we know that &#947;t remains in D = [&#947; -, &#947; + ]. This verifies Assumption B.8 for the modified model.</p><p>We conclude by applying Theorem B.21 that, for some fixed I &gt; 0, for all initial positions in V i 1,q1 &#215; [&#947; -, &#947;+ ], the probability that a trajectory (X T , &#947;T ) (with respect to the modified model) leaves</p><p>2 ) converges to 0 as &#951;&#955; &#8594; 0. The convergence is in addition uniform with respect to the initial position.</p><p>Since such modifications only take place outside V i 1,q1 &#215; [&#947; -, &#947;+ ], the same statement also holds for the original model. As V i 1,q1 &#8834; V i 0,q0 , we obtain the statement of Proposition B.28 after reparamatrization of variables.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C UNIQUE KATZENBERGER LIMIT INSIDE EACH BASIN</head><p>The results from &#167;B.2, stated in the form of Proposition 4.2, guarantee that, after discarding an exponentially small subset of random incidences, the trajectories of (2) stays inside the basin that contains the initial position for exponentially long time O(e C(&#951;&#955;) -1 ). We now justify Proposition 4.3</p><p>We restart (2) from an initial point, still written as x 0 by abuse of notation, in some U i #,p1 . We now apply a uniform approximation theorem, which is a stronger version of <ref type="bibr">(Li et al., 2022a, Theorem 4.6)</ref>. We can prove this uniform approximation result because we can strengthen <ref type="bibr">(Katzenberger, 1991, Theorem 6.3</ref>) to be a compactness theorem, uniformly with respect to the initial points of the SDE. To describe this result, let (&#8486; n , F n , {F n t } t&#8805;0 , P) be a filtered probability space, Z n an R evalued cadlag {F n t }-semimartingale with Z n (0) = 0 and A n a real-valued cadlag {F n t }-adapted nondecreasing process with A n (0) = 0. Let &#963; n : U &#8594; M(d, e) be continuous with &#963; n &#8594; &#963; uniformly on compact subsets of U . Let X n be an R d -valued cadlag {F n t }-semimartingale satisfying, for all compact K &#8712; U ,</p><p>for all t &#8804; &#955; n (K) where &#955; n (K) = inf{t &#8805; 0|X n (t-) / &#8712;} K or X n (t) / &#8712; K is the stopping time of X n leaving K. (47)</p><p>We will present <ref type="bibr">Assumption B.3, B.4 and Condition B.5, B.6, B.7 and B.8 from (Li et al., 2022a)</ref> in Appendix B.</p><p>The main difference of Theorem C.1 to <ref type="bibr">(Katzenberger, 1991, Theorem 6.3</ref>) is that we allow the initial point X n (0) to vary within U .</p><p>Theorem C.2. Let the manifold &#915; and its open neighborhood U satisfy Assumption 3.1 and 3.2. Let K &#8834; U be any compact set and x n,0 &#8712; K be a sequence of initial points. Consider the SGD formulated in (46) where X &#951;n (0) &#8801; x n,0 . Define Y &#951;n (t) = X &#951;n (t) -&#934;(X &#951;n (0), A &#951;n (t)) + &#934;(X &#951;n (0)) and &#181; &#951;n (K) = min{t &#8712; N|Y &#951;n (t) / &#8712; K}. Then the sequence {Y where {W (s)} s&#8805;0 is the standard Brownian motion.</p><p>Proof. The proof of Theorem C.2 follows how <ref type="bibr">(Li et al., 2022a, Theorem B.8</ref>) was proved by using <ref type="bibr">(Li et al., 2022a, Lemma B.6</ref>) and the standard Katzenberger's theorem <ref type="bibr">(Katzenberger, 1991, Theorem 6.3</ref>). One difference is that here not all trajectories stays inside one basin. However, we claim that the probability that trajectories escape the basin goes to zero when &#951;&#955; tends to zero. Once this claim is proved, Theorem C.2 is an immediate consequence of <ref type="bibr">(Li et al., 2022a, Lemma B.6</ref>) and Theorem C.1.</p><p>To prove the claim, we adopt the same idea as in the proof of Theorem 4.2 in Appendix B.4. Although trajectories may escape from level set V i 1,q0 , we can smoothly modify the value of &#963; on the closure V i 1,q1 so that it remains unchanged on V i 1,q0 and vanishes near the boundary &#8706;V i 1,q0 . The SDEs become deterministic, and thus the trajectories of the modified model do not escape from the boundary &#8706;V i 1,q0 . Now, as &#951;&#955; tends to zero, by Theorem B.21 the probability of a trajectory of the modified model leaving &#8706;V i 1,q1 &#215; [&#947; -, &#947;+ ] before a fixed T tends to 0. Since such modifications only take place outside &#8706;V i 1,q1 &#215; [&#947; -, &#947;+ ], the same statement holds for the original model. This finishes the proof of the claim.</p><p>Proof of Proposition 4.3. The above uniform version of the Katzenberger's theorem guarantees that, starting from different initial points in the same compact neighborhood of the basin, the distribution of trajectories associated with (2) is still close to that of the Katzenberger's SDE (47). By Proposition 3.1, the latter is mixing towards a unique equilibrium &#957; i . Note that even though in Chapter 3, we have only proved it for one basin case, Theorem 4.1 shows that with a large probability, the trajectories do not escape from the basin. For those trajectories, they satisfy a modified SDE equation like before, so that all trajectories do not escape from this basin. At this moment, we can directly apply Proposition 3.1. It follows that within any polynomial time window under consideration, the distribution of trajectories associated with (2) are also mixing towards &#957; i . This proves Proposition 4. (Then &#934; is C 2 on U by <ref type="bibr">(Falconer, 1983)</ref>.) Condition C.5. <ref type="bibr">(Li et al., 2022a, Lemma B.</ref>2) The integrator sequence {A n } n&#8805;1 is asymptotically continuous: sup t&gt;0 |A n (t) -A n (t-)| &#8658; 0 where A n (t-) = lim s&#8594;t-A n (s) is the left limit of A n at t. Condition C.6. <ref type="bibr">(Li et al., 2022a, Lemma B.</ref>3) The integrator sequence {A n } n&#8805;1 increases infinitely fast: &#8704;&#1013; &gt; 0, inf t&#8805;0 (A n (t + &#1013;)) -A n (t)) &#8658; &#8734;.</p><p>Condition C.7. <ref type="bibr">((Katzenberger, 1991, Equation 5</ref>.1), <ref type="bibr">(Li et al., 2022a, Lemma B.4</ref>)) For every T &gt; 0, as n &#8594; &#8734;, it holds that sup 0&lt;t&#8804;T &#8743;&#955;n(K)</p><p>&#8741;&#8710;Z n (t)&#8741; 2 &#8658; 0.</p><p>Condition C.8. <ref type="bibr">((Katzenberger, 1991, Condition 4.2)</ref>, <ref type="bibr">(Li et al., 2022a, Lemma B.5</ref>)) For each n &#8805; 1, let Y n be a {F n t }-semimartingale with sample paths in D R d [0, &#8734;). Assume that for some &#948; &gt; 0 allowing &#948; = &#8734; and every n &#8805; 1 there exist stopping times {&#964; m n |m &#8805; 1} and a decomposition of Y n -J &#948; (Y n ) into a local martingale M n plus a finite variation process F n such that P[&#964; m n &#8804; m] &#8804; 1/m, {[M n ](t &#8743; &#964; m n ) + T t&#8743;&#964; m n (F n )} n&#8805;1 is uniformly integrable for every t &#8805; 0 and m &#8805; 1, and lim It was shown in <ref type="bibr">(Li et al., 2022a, Lemma B.6</ref>) that for SGD formulated in ( <ref type="formula">46</ref> </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Different connected components of &#8486; are not required to have the same dimension.</p></note>
		</body>
		</text>
</TEI>
