<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Two-Timescale Linear Stochastic Approximation: Constant Stepsizes Go a Long Way</title></titleStmt>
			<publicationStmt>
				<publisher>International Conference on Artificial Intelligence and Statistics (AISTATS), 2025</publisher>
				<date>01/31/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10569500</idno>
					<idno type="doi"></idno>
					
					<author>Jeongyeol Kwon</author><author>Luke Dotson</author><author>Yudong Chen</author><author>Qiaomin Xie</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Previous studies on two-timescale stochastic approximation (SA) mainly focused on bounding mean-squared errors under diminishing stepsize schemes. In this work, we investigate constant stpesize schemes through the lens of Markov processes, proving that the iterates of both timescales converge to a unique joint stationary distribution in Wasserstein metric. We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes in the presence of Markovian noise. Specifically, with two constant stepsizes α < β, we show that the biases scale linearly with both stepsizes as Θ(α) + Θ(β) up to higher-order terms, while the variance of the slower iterate (resp., faster iterate) scales only with its own stepsize as O(α) (resp., O(β)). Unlike previous work, our results require no additional assumptions such as β 2 ≪ α nor extra dependence on dimensions. These fine-grained characterizations allow tail-averaging and extrapolation techniques to reduce variance and bias, improving mean-squared error bound to O(β 4 + 1 t ) for both iterates.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Stochastic Approximation (SA) is an iterative procedure to find the root of unknown operators from their noisy samples <ref type="bibr">[39]</ref>. There has been a long line of work understanding the convergence behavior of SA both asymptotically <ref type="bibr">[4,</ref><ref type="bibr">21]</ref> and in a finite-time <ref type="bibr">[40]</ref>, with a wide range of applications including stochastic optimization <ref type="bibr">[21,</ref><ref type="bibr">37]</ref> and reinforcement learning <ref type="bibr">[42,</ref><ref type="bibr">32,</ref><ref type="bibr">40]</ref>.</p><p>Two-Timescale Stochastic Approximation (TTSA) is a variant of the SA algorithm, designed to find the root of two intertwined operators <ref type="bibr">[3]</ref>. In particular, given two operators F and G, we aim to find the solution (x * , y * ) satisfying the fixed-point equations F (x * , y * ) = 0, G(x * , y * ) = 0. This work considers linear TTSA with constant stepsizes driven by Markovian data as the following:</p><p>x t+1 = x t -&#945; t (F (x t , y t ) + w x (x t , y t ; &#958; t )), y t+1 = y t -&#946; t (G(x t , y t ) + w y (x t , y t ; &#958; t )), t &#8805; 0,</p><p>where &#945; t &#8801; &#945;, &#946; t &#8801; &#946; &gt; 0 are constant stepsizes for slower and faster iterates respectively, F and G are linear operators, and w x and w y are linear Markovian noises driven by exogenous Markovian states &#958; t (see Section 2 for precise formulation). The updates in <ref type="bibr">(1)</ref> arise in many applications: examples include popular reinforcement learning algorithms such as actor-critic <ref type="bibr">[29,</ref><ref type="bibr">18]</ref> and gradient temporal-difference (GTD) methods <ref type="bibr">[35,</ref><ref type="bibr">43]</ref>, and iterative algorithms for stochastic Bilevel optimization <ref type="bibr">[7,</ref><ref type="bibr">16,</ref><ref type="bibr">22,</ref><ref type="bibr">31]</ref>. The core idea of TTSA is the use of different stepsizes for two iterations, which establishes a hierarchy between the two fixed-point equations. For example, in actor-critic algorithms <ref type="bibr">[18]</ref>, the y-variable minimizes the temporal-difference (TD) error, while the x-variable represents policy parameters to maximize long-term Existing Results for TTSA. TTSA arises as a popular iterative solution in various domain; from the classical iterateaveraging schemes <ref type="bibr">[38]</ref> and off-policy reinforcement learning algorithms <ref type="bibr">[42]</ref> to gradient descent-ascent algorithms for saddle-point problems <ref type="bibr">[27]</ref> and single-loop algorithms for Bilevel optimization <ref type="bibr">[22]</ref>. Asymptotic convergence and central limit theorems for TTSA with diminishing step sizes were initially established for linear cases with i.i.d. noise <ref type="bibr">[30]</ref>, followed by extensions to non-linear and Markovian noise settings <ref type="bibr">[36,</ref><ref type="bibr">23]</ref>.</p><p>More recent work has shifted focus to non-asymptotic results, deriving finite-time convergence rates for both linear <ref type="bibr">[9,</ref><ref type="bibr">8]</ref> and nonlinear cases <ref type="bibr">[28,</ref><ref type="bibr">19,</ref><ref type="bibr">20]</ref>. However, these studies primarily address MSE bounds with diminishing stepsizes. In contrast, we investigate distributional convergence under constant stepsizes, providing explicit decoupling of biases and variances. Additionally, we establish new results for tail-averaging and extrapolation in TTSA schemes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem Setup</head><p>Let F : R dx &#215; R dy &#8594; R dx and G : R dx &#215; R dy &#8594; R dy be linear mean-field operators in the following form:</p><p>where J 11 , . . . , J 22 (resp., b 1 , b 2 ) are fixed matrices (resp., vectors), and linear Markovian noises defined as the following: w x (x, y; &#958;) = W 11 (&#958;)x + W 12 (&#958;)y + u 1 (&#958;), w y (x, y; &#958;) = W 21 (&#958;)x + W 22 (&#958;)y + u 2 (&#958;).</p><p>Let J max := max i,j&#8712;{1,2} &#8741;J ij &#8741; op be the smoothness parameter of the system. The first assumption is on the mean-field operators being Hurwitz: Assumption 1 The matrices -J 22 and -&#8710; := -J 11 + J 12 J -1 22 J 21 are Hurwitz, that is, all real parts of the eigenvalues of J 22 and &#8710; are strictly positive.</p><p>Therefore, a fixed point in the slower timescale is uniquely defined y * (x) = -J -1 22 (J 21 x + b 2 ) for every x, and the target joint fixed point (x * , y * ) is given as:</p><p>Furthermore, for all &#958; &#8712; &#926;, the following holds:</p><p>for some constants W max , u max &#8805; 0. For simplicity, we further assume that W max &#8804; J max .</p><p>The above two assumptions are common in the analysis of SA schemes with Markovian noises <ref type="bibr">[8,</ref><ref type="bibr">24]</ref>. We introduce the notion of noise variances in our setting:</p><p>which reflect the mean-squared fluctuation of the stochastic update around the fixed point.</p><p>We study the convergence of TTSA iterations (1) via L 2 -Wasserstein distance <ref type="bibr">[44]</ref>. Let P 2 (R d ) denote the space of square-integrable distributions on R d where d := d x + d y . Note that L 2 -Wasserstein distance between two distributions &#181; and &#957; in P 2 (R d ) is defined as the following:</p><p>, where &#928;(&#181;, &#957;) is a set of all possible couplings with marginal distributions &#181; and &#957;. To study the distribution convergence of the joint iterate-data sequence (x t , y t , &#958; t ) t&#8805;0 , we slightly extend the definition above to add hamming distance in &#926;.</p><p>Let P 2 (R d &#215;&#926;) be the set of distributions &#956; on R d &#215;&#926; with the property that the marginal of &#956; on R d is square-integrable.</p><p>Definition 1 For any two probability measures &#181;, &#957; in P 2 (R dx+dy &#215; &#926;) over (x, y, &#958;), we define the distance between &#181; and &#957; as</p><p>where &#928;(&#181;, &#957;) is a set of all possible couplings with marginal distributions &#181;, &#957;.</p><p>To establish the finite-time convergence of TTSA iterations (1), we define a few error metrics. Let Q x , Q y &#8827; 0 be the unique solutions of the Lyapunov equations</p><p>The solutions Q x , Q y , which are guaranteed to exist since -&#8710;, -J 22 are Hurwitz under Assumption 1 <ref type="bibr">[5]</ref>, are used for constructing the drift of potentials in our analysis. For the slower and faster iterates, we use &#8741; &#8226; &#8741; Qx and &#8741; &#8226; &#8741; Qy norms respectively, and define &#181; x := &#8741;Q x &#8741; -1 op and &#181; y := &#8741;Q y &#8741; -1 op . Note that &#963; min (&#8710;) &#8805; &#181; x /2 and &#963; min (Q x ) &#8805; &#8741;&#8710;&#8741; -1 op /2, and similarly for Q y and &#181; y . Consequently, we let the condition number of two iterations as &#954; x := &#954;yJmax &#181;x and &#954; y := Jmax &#181;y .</p><p>Notation. For a positive definite matrix Q &#8827; 0 let &#8741;x&#8741; Q := x &#8868; Qx for a vector x. With a general real-valued matrix A, we define</p><p>For two real-valued matrices A, B, we denote &#10216;A, B&#10217; = Tr(A &#8868; B). We define 1-Schatten norm &#8741;A&#8741; 1 := i |&#963; i (A)| as the absolute sum of singular values (sometimes we call it S 1 -norm), and &#8734;-Schatten norm &#8741;A&#8741; &#8734; := max i |&#963; i (A)| be the maximum singular value, which is equivalent to matrix operator norm &#8741;A&#8741; op . For a positive semi-definite matrix Q &#10928; 0,</p><p>ii is the sum of diagonal elements. For a random vector x, we denote the covariance</p><p>We often use shorthands w x t := w x (x t , y t ; &#958; t ) and w y t := w y (x t , y t ; &#958; t ). We denote the fixed point of the faster iterates given x as y * (x), such that G(x, y * (x)) = 0. If we just write y * , then it means y * (x * ). For two probability distributions p, q, we denote &#8741;p -q&#8741; 1 as the total-variation distance between p and q. We use the notation O(&#8226;) to hide absolute constants, and O P (&#8226;) to omit up to polynomial factors in instance-dependent constants (smoothness, minimum eigenvalues, and noise variances) and up to logarithmic factors in stepsizes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Main Results</head><p>We start with two conditions for stepsizes to ensure the stability of TTSA iterations: Assumption 4 We assume that the stepsizes (&#945;, &#946;) satisfy the following:</p><p>where &#964; &#945; := log(&#945;&#181;x/c&#961;) log &#961; with some sufficiently small absolute constants c 1 , c 2 &gt; 0.</p><p>The first condition in (4) ensures &#946; less than the inverse smoothness of operators, and the second condition bounds the ratio between two-timescale iterations. We mention that the dependence on the condition numbers is not fully optimized. In the sequel, we start with a fine-grained convergence in MSE in Section 3.1. We then show the convergence in distribution and characterize the biases and variances of the limit distribution in 3.2, which is followed by our final result on tail-averaging and extrapolation in Section 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Convergence in MSE</head><p>We analyze the MSE convergence of linear TTSA in terms of the centered iterates xt := x t -x * , &#563;t := y t -y * (x t ).</p><p>To this end, we first rewrite the stochastic recursion as the following:</p><p>Then equation (1) can be rewritten as:</p><p>Note that the slower iterates view the error in faster iterates as an additional noise. We are now ready to state our first main convergence theorem with constant step-sizes.</p><p>Theorem 3.2 Suppose Assumptions 1-3 hold, and the step sizes &#945;, &#946; satisfy Assumption 4. Then, for all t &#8805; 0 following the TTSA recursion (1), we have</p><p>where we define potential functions as</p><p>The theorem states that after sufficiently large iterations t &#8811; &#945; -1 , the convergence of TTSA in MSE can be characterized as the following:</p><p>To our best knowledge, this is the first result that explicitly characterizes the fine-grained scaling of MSE w.r.t. the stepsizes and noise variances of each iteration. The work in <ref type="bibr">[9,</ref><ref type="bibr">40]</ref> only obtained an O(&#946; 2 /&#945;) asymptotic bound for the slower iterate. More recent work in <ref type="bibr">[28,</ref><ref type="bibr">20]</ref> obtained an O(&#945;) asymptotic bound but required &#946; 2 &#8804; &#945;, hence not strong enough to reveal the dependence on &#946;. Our result shows that noises from slower iterates only change x t by O(&#945;), while noises from faster iterates influence x t by O(&#945; + &#946; 2 ), without requiring &#946; 2 &#8804; &#945;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Convergence to a Limit Distribution</head><p>Now we state the distributional convergence of the process (x t , y t , &#958; t ) in Wasserstein distance as defined in Definition 1.</p><p>We require a mild assumption on the fourth-order moments of initial distributions:</p><p>Assumption 5 We assume that the fourth-order moments of the initial distribution are bounded, i.e., E[&#8741;x 0 &#8741; 4 2 +&#8741;&#563; 0 &#8741; 4 2 ] &lt; &#8734;.</p><p>Our main theorem establishes the linear convergence of the Markovian process (x t , y t , &#958; t ) t&#8805;0 in W2 -distance to a unique stationary distribution: Theorem 3.3 Suppose Assumptions 1-3 hold, and step sizes &#945;, &#946; satisfy Assumption 4. If we start from an arbitrary initial distribution (x 0 , y 0 , &#958; 0 ) &#8764; &#181; 0 satisfying Assumption 5, then there exists a unique stationary distribution &#181; such that the process (x t , y t , &#958; t ) &#8764; &#181; t linearly converges in W2 -distance:</p><p>Furthermore, there exists vectors bx i , by</p><p>and variances of x &#8734; and y &#8734; are bounded by</p><p>A few remarks follow below. First, the theorem states that any sequence following TTSA (1) converges to some unique stationary distribution depending on problem instances and step sizes. Given the existence of the unique stationary distribution &#181;, henceforth, we can define random variables from the limit distribution (x &#8734; , y &#8734; , &#958; &#8734; ) &#8764; &#181;.</p><p>Second, the limit distribution has a bias, whose dominating term grows linearly with the stepsizes. The &#946;-wise growth in the bias of faster iterates y t is not surprising in light of known results for Markovian single-timescale SA <ref type="bibr">[33,</ref><ref type="bibr">24]</ref>. More interesting is the bias of the slower iterates x t , which also grows linearly with &#946;, even though the size of the update is only O(&#945;) in each slow iteration. This is a unique phenomenon of two-timescale SA: the slower iterate effectively views the error from faster iterates, y t -y * (x t ), as additional "biased" noise.</p><p>Finally, the theorem shows that the limit distribution of slower iterates has an interesting property: the bias in x (slower iterates) is dominated by the faster step-size &#946;, while its variance only scales with the slower step-size &#945;. This is another key property of two-timescale SA that has been overlooked in prior work. In particular, we can deduce that the asymptotic MSE of slower iterates is resulted from two factors:</p><p>Focusing separately on the two iterates, we have the following more fine-grained convergence results:</p><p>Then for all t &#8805; 0, we have the bounds</p><p>This corollary explicitly states how the optimization error decays from arbitrary initial points, and will be used in showing the convergence of tail-averaging next.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Tail-Averaging and Extrapolation</head><p>Using the explicit characterization of bias and variance in Theorem 3.3, we derive improved convergence rates for tail-averaging and extrapolation.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Averaging</head><p>We first consider the tail-averaging variant of Polyak-Ruppert averaging <ref type="bibr">[26]</ref>:</p><p>where t 0 &#8819; &#945; -1 is the length of the warm-up period. With the result from Theorem 3.2, we can analyze the MSE of tail-averaged sequence:</p><p>Theorem 3.5 Suppose Assumptions 1-5 hold and t 0 &gt; C(&#945;&#181; x ) -1 for some sufficiently large absolute constant C &gt; 0.</p><p>Then for all t &gt; t 0</p><p>In the above result, we omitted an additional optimization error exp(-c&#945;&#181; x t 0 ) since it is dominated by other terms with t 0 &#8811; 1/(&#945;&#181; x ). As we can observe, O(&#946; 2 ) is attribted to the squared-bias, and O(1/t) convergence is the variance decaying effect of tail-averaging. We also observe that the faster iterates has extra O( 1 t &#946; 2 /&#945;)-term. In part, this is because we measure the MSE of &#7929;t from y * = y * (x * ), not from y * (x t ). However, we are not fully aware whether this is an artifact of an analysis, or can be removed, and we leave the question as an open problem. Note that when &#946; 2 &#8804; &#945;, both iterates enjoy the same O(1/t)-decaying rate of variances as if the two iterates are decoupled.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2">Extrapolation</head><p>When tail-averaging can reduce the variance, extrapolation can reduce the biases of each iterate. As one example, using the fact that biases of iterates grow linearly with step sizes, we can extrapolate two sequences, (x &#945;,&#946; t , y &#945;,&#946; t ) and  (a) Absolute error in the slower timescale. (b) Absolute error in faster timescale. (x 2&#945;,2&#946; t , y 2&#945;,2&#946; t</p><p>) with pairing stepsizes (&#945;, &#946;) and (2&#945;, 2&#946;). The extrapolated iterates are computed as</p><p>As a corollary of our main theorems, we have the following result characterizing the MSE of the extrapolated sequences. Extrapolation achieves reduced biases by canceling out the leading &#945; and &#946; terms in the asymptotic biases <ref type="bibr">(6)</ref>, improving the MSE bounds of both iterates from &#946; 2 to &#946; 4 . Corollary 3.6 Suppose Assumptions 1-5 hold and t 0 &gt; C(&#945;&#181; x ) -1 for some sufficiently large absolute constant C &gt; 0. Then for all t &gt; t 0 ,</p><p>Remark 1 If one uses pairing stepsizes (&#945;, &#946;) and (&#945;, 2&#946;), then only the leading &#946; terms in the asymptotic biases <ref type="bibr">(6)</ref> are cancelled.</p><p>Remark 2 It is possible to further reduce the order of bias via higher-order extrapolation using more than two sets of stepsizes as in <ref type="bibr">[24,</ref><ref type="bibr">25]</ref>, though it comes at the price of potentially slower convergence and higher variance due to using additional stepsizes <ref type="bibr">[15,</ref><ref type="bibr">40]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>We consider the TTSA iteration (1) in dimension d x = d y = 2 driven by a 10-state, irreducible, aperiodic Markov chain. We construct the transition matrix randomly and choose J 11 , J 12 , J 21 , J 22 such that Assumption 1 hold.</p><p>We tested the dependence of the bias and variance of both iterates with respect to &#945; and &#946; by varying each individually. After the tail-averaged iterates converged, we calculated the bias as the average distance between the averaged iterate and the true solution, and calculated the variance Tr V(&#8226;) as the average square distance from the iterate to the sample mean of the iterates. For the dependence on &#946;, we held &#945; constant and varied &#946; between 0.03 and 0.07. For the dependence on &#945;, we held &#946; constant and varied &#945; between 0.0001 and 0.0005.</p><p>For the slower iterate x t , Figure <ref type="figure">1</ref> shows that the bias scales with both &#946; and &#945;, while the variance is dependent mostly on &#945; only. For the faster iterate y t , Figure <ref type="figure">2</ref> shows that the bias depends both on &#946; and &#945;, and the variance depends on &#946;. Both results are consistent with our theory.</p><p>We also tested the effects of tail-averaging (TA) and Richardson-Romberg (RR) extrapolation with a similar setup. We fixed &#945; = 0.0003 and let &#946; = {0.01, 0.02, 0.04, 0.08}. In Figure <ref type="figure">3</ref>, for each &#946;, we plotted the absolute errors achieved by tail-averaging at stepsize &#946; (labeled as "TA &#946; = stepsize"), as well as the errors achieved by RR extrapolation with stepsizes &#946; and 2&#946; (labeled as "RR &#946; = stepsize, 2 * stepsize"), which aims to cancel the &#946; term in the bias. Compared to the TA iterates (solid line), the corresponding RR extrapolated iterate (the dashed line of the same color) achieved lower errors, corresponding to reduced asymptotic biases.</p><p>In addition, we examined the effectiveness of applying RR extrapolation to cancel both the &#945; and &#946; bias terms. Letting &#945; = 0.0003, &#946; = 0.02, we compared RR extrapolating on only &#946; (using stepsizes &#946;, &#945; and 2&#946;, &#945;) with RR extrapolating on both &#946; and &#945; (using stepsizes &#946;, &#945; and 2&#946;, 2&#945;). In Figure <ref type="figure">4</ref>, we see that while the former (red curves) already reduced a large amount of the bias, the latter (black curves) reduced it even further, as predicted by our theoretical results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Analysis</head><p>We outline the proofs of our main theorems. We focus on slower iterates; similar ideas apply to faster iterates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Proof Outline of Theorem 3.2</head><p>The first step is to analyze the descent formula for each iterate separately. For the slower iterate, we have</p><p>The term T 1 would have been 0 if the noise sequence were martingale, and can be effectively handled with Markovian noises in a standard way by exploiting Assumption 2. More pressing issue is handling T 2 : with naively applying Young's inequality to bound (ii), i.e., with &#10216;x t , J 12 &#563;t &#10217; &#8804; (c&#8741;x t &#8741; 2 + Jmax 4c &#8741;&#563; t &#8741; 2 ), the asymptotic error easily end up being O(&#946; 2 /&#945;) as in <ref type="bibr">[9,</ref><ref type="bibr">17]</ref>, and such an approach can be improved up to at best O(&#946;) <ref type="bibr">[11]</ref>.</p><p>Recent results in <ref type="bibr">[28,</ref><ref type="bibr">20]</ref> directly analyzed the descent behavior of &#8741;T 2 &#8741; op , achieving O(&#945;) asymptotic error for the slower iterate. However, using operator norm often results in extra dependence on dimensions d x , d y , despite the smoothness condition J max = O(1) in operator norm.</p><p>Our tweak for this issue is simple: to track the convergence of cross-correlation norm, we employ the Schatten S 1 -measure for &#8741;Q</p><p>term is incorporated to ensure decreasing Lyapunov potential with asymmetric operators. The S 1 -norm is the best suited for exploiting the smoothness condition without incurring dimension dependence, thanks to the Holder's inequality for matrix Schattern norm:</p><p>Leveraging this property, we can construct the potential function as the sum of three terms (omitting constants):</p><p>With similar techniques for analyzing the faster iterates and cross-correlation norms, we can obtain a clean O(&#945;) asymptotic error without additional dimension dependence. The full proof is given in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Proof Outline of Theorem 3.3</head><p>Once we have the MSE convergence result, extending the strategies in prior work for the single-timescale SA <ref type="bibr">[10,</ref><ref type="bibr">24]</ref>, we first consider two coupling sequences via sharing the common noise sequence (x</p><p>1 t , y 1 t , &#958; t ) and (x 2 t , y 2 t , &#958; t ). The idea is to show that the coupled sequences &#948; x t := x1 t -x2 t , &#948; y y := &#563;1 t -&#563;2 t converge linearly (Lemma B.1),</p><p>Then we can design two sequences coupled in such a way that (x 2 t , y 2 t , &#958; t )</p><p>Combining the two results, the sequence (x 1 t , y 1 t , &#958; t ) converges in L 2 -Wasserstein distribution to a unique stationary point. The remaining details can be found in Appendix B.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Bias and Variance</head><p>Turning to the stationary distributions of the iterates, we observe that x &#8734; satisfies</p><p>Conditioned on the event &#958; &#8734;+1 = &#958;, we have &#958; &#8734; &#8764; P &#8224; (&#8226;|&#958;), where P &#8224; is the adjoint of the transition kernel P. Using this relation, we can construct a stationary equation for E[x &#8734; |&#958; &#8734; = &#958;], and find the explicit expression for biases by integrating the conditional expectation over a stationary distribution &#960;, i.e.,</p><p>The variance of &#563;&#8734; is relatively simple to bound:</p><p>However, showing the variance upper bound O(&#945;) can not be derived in the same fashion since the MSE bound for x&#8734; is O(&#945; + &#946; 2 ). To derive this, we also construct a stationary equation for the covariance:</p><p>and show that S 1 -norm of the above is O(&#945;). Using the inequality Tr(A) &#8804; &#8741;A&#8741; 1 completes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Technical Lemmas</head><p>Lemma A.1 For any two real matrices A, B, we have</p><p>Lemma A.2 For a positive definite matrix Q &#8827; 0 and any real matrix A, the following holds:</p><p>Lemma A. <ref type="bibr">3</ref> For a positive definite matrix Q &#8827; 0, and for any vectors x, y and a matrix M ,</p><p>is the condition number of Q.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma A.4 (Lemma C.13 in [20])</head><p>Let -A be a Hurwitz matrix and Q be the solution to</p><p>Then for all &#1013; &#8712; 0,</p><p>, for any matrix B, we have</p><p>Lemma A. <ref type="bibr">5</ref> For any two positive definite matrices Q 1 , Q 2 and a vector x, we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1 Auxiliary Lemmas</head><p>We list some useful facts and lemmas here.</p><p>Lemma A. <ref type="bibr">6</ref> For any t &#8805; &#964; , for all i, j &#8712; {1, 2}, we have</p><p>where v t , u t are any vectors that can be constructed at the t th iteration. Lemma A. <ref type="bibr">7</ref> Let two intermediate variables:</p><p>Then, w x t , w y t can be rewritten as</p><p>Lemma A.8 For any t &#8805; &#964; &#8805; &#964; &#945; , we have</p><p>The following corollary is convenient:</p><p>Lemma A.10 For any t &#8805; &#964; &#8805; c log( &#954;y &#945;&#181;y ) with an absolute constant c &gt; 0, we have</p><p>Similarly, we can derive the same upper bound for &#8741;E[w y t &#563;&#8868; t ]&#8741; 1 .</p><p>Lemma A.11 For any t &#8805; &#964; &#8805; c log( &#954;x &#945;&#181;x ) with an absolute constant c &gt; 0, we have</p><p>Similarly, we can derive the same upper bound for &#8741;E[w</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proof of Main Theorems</head><p>We recall the definition of &#963; x , &#963; y in ( <ref type="formula">2</ref>)</p><p>Recall that we assume &#946;/&#945; &#8811; &#954; y in Assumption 4, and &#8741;&#8710;&#8741; op &#8804; J max &#954; y .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 Proof of Theorem 3.2</head><p>The proof first investigates the convergence of three terms</p><p>Then, by constructing the potential function as the following:</p><p>and show that they decay in exponential rates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1.1 Convergence of &#563;t</head><p>We first study the descent behavior of &#563;t :</p><p>We bound each term:</p><p>1. Using Lemma A.4, we have</p><p>2. Using the formula in Lemma A.7,</p><p>where we also used Lemma A.3 to have</p><p>Qy , and &#954;(Q y ) = O(&#954; y ). 3. Using Cauchy-Schwarz inequality, we have</p><p>4. We separate the cross-product term across &#563;t and xt :</p><p>|.</p><p>For (i), we can derive that</p><p>For (ii), we can simply apply Cauchy-Schwarz inequality with</p><p>For (iii), we bound the term as</p><p>Combining (i)-(iii), we get</p><p>5. For the cross-product with noise, we get</p><p>6. For the last term, we simply apply Cauchy-Schwartz inequality and use inequalities used before:</p><p>Hence to summarize, with &#945; &#8810; &#946;/&#954; 3 y and &#946; &#8810; 1/(J max &#954; 2 y ), we get</p><p>Then we can invoke Lemma A.10 with &#964; = O(&#964; &#945; ), and noting that &#946;J 2 max &#954; y &#8810; &#181; y to conclude that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1.2 Convergence of xt</head><p>We start with taking squared-&#8741; &#8226; &#8741; Qx norm for the slower iterates:</p><p>Following the similar steps for the analysis of &#563;t , we show the followings:</p><p>1. The main drift term satisfies</p><p>2. For the squared terms,</p><p>3. For the cross-product term,</p><p>4. For the product term with noise, we have</p><p>Writing down the intermediate result, with &#945; &#8810; 1/(J max &#954; x &#954; 3 y ), we have</p><p>Invoke Lemma A.11 with &#964; = O(&#964; &#945; ), and we can conclude that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1.3 Convergence of Cross-Correlations in S 1 -Norm</head><p>We start with unfolding the equation:</p><p>&#946;J 22 )&#563; t x&#8868; t (I -&#945;&#8710;) -&#945;(I -&#946;J 22 )&#563; t (J 12 &#563;t + w x t ) &#8868; -&#945;J -1 22 J 21 (J 12 &#563;t + &#8710;x t + w x t )x &#8868; t -&#946;w y t x&#8868; t + &#945; 2 J -1 22 J 21 (J 12 &#563;t + &#8710;x t + w x t )(&#8710;x t + J 12 &#563;t + w x t ) &#8868; + &#945;&#946; &#8226; w y t (&#8710;x t + J 12 &#563;t + w x t ) &#8868; . The target norm is &#8741; &#8226; &#8741; 1 bound on the expectation of the cross-product term. The trick is to multiply Q 1/2 y from left on both sides, and use identity</p><p>We observe the following:</p><p>&#8741; op = &#8741;I -&#946;J 22 &#8741; Qy &#8804; 1 -&#181; y &#946;, and therefore</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">In all other terms, we use inequality &#8741;E[uv</head><p>). We omit some algebraic details, and state the desired bounds:</p><p>Applying Lemma A.11 and A.10, and using &#945; &#8810; &#946;/&#954; 2 y in Assumption 4, we can conclude that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1.4 Overall Convergence</head><p>Recall the potential function V t and U t in (10):</p><p>We note that &#946; &#8810; 1/(&#954; 3 y &#954; x J max ) and &#946;/&#945; &#8810; 1/(&#954; 3 y &#954; x ) in Assumption 4, and</p><p>Putting altogether, we have</p><p>Solving this recursion,</p><p>x , for all t, hence for all sufficiently large t &#8811; &#945; -1 , we have bounds for</p><p>Next, we consider the potential for faster iterates U t . We see that</p><p>E[&#8741;y t &#8741; 2 Qy ] &#8804; U t &#8804; exp(-&#946;&#181; y t/2)U 0 + &#946;&#954; 4 y &#964; &#945; exp(-&#945;&#181; x t/4)V 0 + O &#954; 2 y &#964; &#181; 2 y &#945; &#946; + &#964; 2 J max &#954; 4 y &#181; 2 x &#946; &#945;&#963; 2 x + O &#964; &#181; 2 y &#946; &#963; 2 y , assuming &#946;&#181; y &#8811; &#945;&#181; x . Thus for sufficiently large t &#8811; &#945; -1 log(1/&#946;), we have E[&#8741;&#563; t &#8741; 2 Qy ] = O P (&#945; 2 /&#946; + &#945;&#946;)&#963; 2 x + O P (&#946;)&#963; 2 y . This concludes the final error rates as t &#8594; &#8734;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Proof of Theorem 3.3</head><p>Showing the distributional convergence consists of two steps. First, we setup two sequences {(x 1 t , y 1 t , &#958; t )} t&#8805;0 , {(x 2 t , y 2 t , &#958; t )} t&#8805;0 coupled with the same sequence of Markovian states {&#958; t } t&#8805;0 . We show that these two sequences will converge in the squared-L 2 expectation sense:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma B.1 Under Assumptions 1-4, for any two sequences coupled with the same Markovian nosie (x 1</head><p>t , y 1 t , &#958; t ) and (x 2 t , y 2 t , &#958; t ), the following holds:</p><p>We first prove the above lemma and use it to conclude that the distribution of iteration variables converges in Wasserstein distance to a unique stationary distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2.1 Proof of Lemma</head><p>B.1 Let us define &#948; x t = x1 t -x2 t , &#948; y t = &#563;1 t -&#563;2 t . Then the stochastic recursion (1) becomes &#948; x t+1 = (I -&#945;&#8710;)&#948; x t -&#945;J 12 &#948; y t -&#945;&#948; wx t , and for y, we have &#948; y t+1 = (I -&#946;J 22 )&#948; y t -&#945;J -1 22 J 21 (J 12 &#948; y t + &#8710;&#948; x t ) -(&#945;J -1 22 J 21 &#948; wx t + &#946;&#948; wy t ),</p><p>where the noise differences are given by:</p><p>where we used the expression in Lemma A.7. This can be considered as the same TTSA recursion with &#963; x = &#963; y = 0. Therefore, the remaining steps are equivalent to the pilot result with &#963; x = &#963; y = 0, and it leads to</p><p>where we define</p><p>This shows that (x 1 t , &#563;1 t ), (x 2 t , &#563;2 t ) converges exponentially fast with the noise coupling, which in turn means (x 1 t , y 1 t ) and (x 2 t , y 2 t ) couples exponentially fast since</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2.2 Distributional Convergence via Coupling</head><p>The steps here mostly follows the proof steps in <ref type="bibr">[24]</ref>, Appendix A.2.2. We first consider a sequence (&#958; 1 t , x 1 t , y 1 t ) t&#8805;0 that starts at (x 1 0 , y 1 0 , &#958; 1 0 ) &#8764; &#181; 0 sampled from some initial distribution &#181; 0 where &#958; 1 0 &#8764; &#960; and (x 1 0 , y 1 0 ) are statistically independent. Then, we similarly define the initial point distribution of the second sequence (x 2 -1 , y 2 -1 ) as the same as (x 1 0 , y 1 0 ) and set (x 2 0 , y 2 0 ) be the result of one-step stochastic recursion <ref type="bibr">(1)</ref>, where &#958; 2 -1 &#8764; P &#8224; (&#8226;|&#958; 1 0 ). Then we couple the Markovian states &#958; 1 t = &#958; 2 t for all t &#8805; 0. Now that we have</p><p>1 follws a stationary distribution &#960;, and &#958; 1 t = &#958; 2 t is coupled. Then by definition of Wasserstein distance (with the optimal coupling), using Lemma B.1, we get</p><p>and therefore (omitting superscript)</p><p>with &#945; &gt; 0. The probability space over &#926; &#215; R dx &#215; R dy equipped with W2 -norm is known to be a Polish space where every Cauchy sequence converges ( <ref type="bibr">[44]</ref>, Theorem 6.18). Furthermore, convergence in Wasserstein distance implies weak convergence ( <ref type="bibr">[44]</ref>, Theorem 6.9), hence weak convergence to some distribution &#181; &#8712; P 2 (&#926; &#215; R dx &#215; R dy ).</p><p>We take conditional expectation on &#958; &#8734;+1 = &#958;, we have the backward conditional probability &#958; &#8734; &#8764; P &#8224; (&#8226;|&#958; &#8734;+1 = &#958;).</p><p>Let T : &#926; &#215; R * &#8594; &#926; &#215; R * an unnormalized Markov operator over &#958;:</p><p>Using the above notation, we can rewrite the recursion as</p><p>Let &#928; = 1 &#960;, and note that &#928;{W ij } = 0 for all i, j &#8712; {1, 2}. We eventually want to characterize zx := E[x &#8734; -x * ] = &#960;{z x } = &#958; z x (&#958;)d&#960;(&#958;) and zy := &#960;{z y }. Since &#960;P &#8224; = &#960; by the time-reversing property of the geometrically mixing chain, this implies &#8710;z x + J 12 zy + &#960;{w x } = 0,</p><p>To further proceed, let &#948; x (&#958;) = z x (&#958;) -&#960;{z x }, &#948; y (&#958;) = z y (&#958;) -&#960;{z y }, and since (P &#8224; -&#928;){z x } = (P &#8224; -&#928;){&#948; x }, we can observe that</p><p>Then we note that</p><p>Plugging this back into (17) yields</p><p>In turn, we have</p><p>Rearranging ( <ref type="formula">18</ref>) yields</p><p>The operation (I -P &#8224; + &#928;) is invertible (see Corollary B.6), and thus we can invert the operator (I -P &#8224; + &#928;). Putting all relationships together leads us to the recursion:</p><p>where</p><p>are independent of the choice of &#945;, &#946;. Next, we bound the norm of &#948; x , &#948; y , d x , d y , and thus the norm of zx , zy .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2.5 Additional Preliminaries for Bounding Norms</head><p>Before we proceed, we define the notion of norms that we use in the proof. For vector-valued quantities, let us define &#8741;v&#8741; L 2 (&#960;) as</p><p>and for the Markov kernel T ,</p><p>For matrices, we use the conjugate norm-pair &#8741; &#8226; &#8741; 1 and &#8741; &#8226; &#8741; &#8734; = &#8741; &#8226; &#8741; op . Specifically, for matrix-valued quantities, we define &#8741;A&#8741; S 1 (&#960;) as</p><p>and</p><p>The following holder's inequality is crucial to obtain dimension-free bounds on variances: <ref type="bibr">4</ref> For Markov kernel T and conditional matrix A(&#958;), We have</p><p>where</p><p>Using the results from Markov chain literature, we have the following lemma:</p><p>Lemma B.5 (Proposition 22.3.5 in <ref type="bibr">[13]</ref>) Let P be a Markov Kernel on a Boral state-space &#926; with invariant probability &#960;. Under Assumption 2, we have</p><p>The following is the corollary:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2.6 Norm-Bounds for Stationary Bias</head><p>We show that &#8741;&#948; x &#8741; L2(&#960;) = O P (&#945;), &#8741;&#948; y &#8741; L2(&#960;) = O P (&#946;). First, we note that</p><p>where</p><p>The last inequality is because</p><p>by Theorem 3.2, and similarly, we can also show that &#8741;z x &#8741; L 2 (&#960;) = O P (&#945; + &#946; 2 ). Furthermore,</p><p>for the same problem-dependent constants C 1 , C 2 , C 3 as defined above. This concludes that for &#945; &#8810; &#946; &#8810; 1/ max(C 1 , C 2 ), we have</p><p>Similarly, we can show that</p><p>Similarly, we have &#8741; by 1 &#8741; 2 = O P (1) and bx i = O P (1) for i = 1, 2. We can plug this result back to <ref type="bibr">(19)</ref> to conclude the bias part of Theorem 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2.7 Dimension-Free Bounds for Variances</head><p>We note that the variance of x &#8734; is measured by</p><p>where the expectation is taken over the stationary distribution (x &#8734; , y &#8734; , &#958; &#8734; ) &#8764; &#181;, and thus we aim bound Tr(V(x &#8734; )). For y, it is sufficient to bound y &#8734; by O(&#946;). To see this, note that</p><p>To proceed, we also get the expression for &#931; xy :</p><p>where</p><p>and C &#8242; is appropriately defined. The steady-state equation is</p><p>and the system equation is</p><p>Combining these results, we can conclude that</p><p>Now plugging this back to <ref type="bibr">(24)</ref>, we have</p><p>Then using these results, from <ref type="bibr">(23)</ref>, we can derive that</p><p>Therefore, we can conclude that &#8741; &#931;x &#8741; 1 = O P (&#945;). Since &#8741; &#931;x &#8741; 1 = Tr( &#931;x ) = Tr(V(x &#8734; )), we obtain the last part of the theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2.8 Proof of Corollary 3.4</head><p>This is in fact a corollary of Lemma B.3. To see this, apply Lemma B.3 with &#181; 2 0 = &#181;, and then note that under optimal coupling between &#181; 0 and &#181;,</p><p>Similarly,</p><p>Then from Theorem 3.3, applying Tr(V(x &#8734; )) = O P (&#945;) and Tr(V(&#563; &#8734; )) = O P (&#946;), we have the lemma.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Tail Averaging and Extrapolation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3.1 Proof of Theorem 3.5</head><p>First, let us define V k , U k as:</p><p>For all k &#8805; t 0 &#8811; (&#945;&#181; x ) -1 log(1/(&#945;&#181; x )), with Theorem 3.4, we ensure that under an optimal coupling,</p><p>and similarly,</p><p>Slower Iterate: We want to analyze</p><p>where x &#8734; &#8764; &#181;(x). To show that this quantity is O(&#945;), it suffices to bound</p><p>Qx ] under the optimal coupling. Rewriting this term,</p><p>We first note that by Theorem 3.4, under the optimal coupling between x k and x &#8734; , we get</p><p>To proceed, we note that</p><p>To bound the second term, we first note that for any k &#8242; &gt; 0, we use an optimal coupling between x k+k &#8242; |F k and x &#8734; , and again apply Theorem 3.4:</p><p>Using Cauchy-Schwarz inequality, we have</p><p>where we used that V k = O P (&#945;) for all k &#8805; t 0 . Plugging this, we can conclude that</p><p>Faster Iterate: In this case, we first note that</p><p>where &#7929;k := 1 t-t0 t t &#8242; =t0 &#563;t . The second term is squared-bias in order O P (&#946; 2 ), and the third term inherits the error analysis from slower iterates. Thus, we focus on bounding the first term.</p><p>Following the same process for slower iterates, we first note that</p><p>For the first term, we invoke Corollary 3.4, under optimal coupling, we have</p><p>For the second term, with Corollary 3.4, we have</p><p>and again using Cauchy-Schwarz inequality, we can show that</p><p>On the other hand, we can apply Cauchy-Schwarz inequality in different ways:</p><p>Summarizing the results, we can conclude that</p><p>This concludes Theorem 3.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3.2 Proof of Corollary 3.6</head><p>We note that</p><p>Note that from Theorem 3.3,</p><p>The first and second terms in ( <ref type="formula">25</ref>) can be bounded by O P (1)/(t -t 0 ), following exactly same steps in the proof of Theorem 3.5. The result for faster iterates can also be derived similarly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Deferred Proofs</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Proof of Lemma 3.1</head><p>The stochastic approximation equation becomes</p><p>Using H(x * ) = 0, G(x, y * (x)) = 0, we can rewrite the recursion as <ref type="bibr">(5)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Proof of Lemma B.2</head><p>We can start with a coarse bound on &#8741;&#563; t+1 &#8741;</p><p>2 Qy : &#8741;&#563; t+1 &#8741; 2 Qy &#8804; &#8741;(I -&#946;J 22 )&#563; t &#8741; 2 Qy + &#945; 2 &#8741;J -1 22 J 21 (J 12 &#563;t + &#8710;x t + w x t )&#8741; 2 Qy + &#946; 2 &#8741;w y t &#8741; 2 Qy + 2&#945; &#10216;(I -&#946;J 22 )&#563; t , -J -1 22 J 21 (J 12 &#563;t + &#8710;x t + w x t )&#10217; Qy + 2&#946; &#10216;(I -&#946;J 22 )&#563; t , -w y t &#10217; Qy + 2&#945;&#946; &#10216;J -1 22 J 21 (J 12 &#563;t + &#8710;x t -w x t ), w y t &#10217; Qy &#8804; (1 -&#946;&#181; y /2)&#8741;&#563; t &#8741; 2 Qy + O P (&#946; 2 )&#8741;x t &#8741; 2 2 + O P (&#945;)&#963; 2 x + O P (&#946;)&#963; 2 y .</p><p>Thus, taking the square on both sides, we get</p><p>Given the above results, let c 2 &gt; 0 be another sufficiently large absolute constant. Now if t &#8804; c 2 log(&#946;/&#945;)/(&#945;&#181; x ), then we set &#964; = c 1 &#964; &#945; /4 such that t &gt; 4&#964; . With TV(&#181; 1) &#8804; exp(-&#945;&#181; x t/4), we have</p><p>On the other hand, if t &gt; c 2 log(&#946;/&#945;)/(&#945;&#181; x ), then we take &#964; = t/8. In this case, we instead invoke the MSE result in Theorem 3.2, which gives</p><p>Together with TV(&#181; 1 &#964; (&#958; 1 &#964; ), &#181; 2 &#964; (&#958; 2 &#964; )) &lt; &#961; t/8 &#8804; exp(-&#945;&#181; x t/4), we get the same conclusion that</p><p>The inequality for y in ( <ref type="formula">16</ref>) can also be similarly proven.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Proof of Technical Lemmas D.1 Proof of Lemma A.1</head><p>This result follows immediately from the fact that Tr(AB) &#8804; &#8741;AB&#8741; 1 , and H&#246;lder's inequality applied to matrix p-Schattern norm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Proof of Lemma A.2</head><p>By definition of &#8741;A&#8741; Q , we start from</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 Proof of Lemma A.3</head><p>By definition for &#10216;</p><p>Next, we observe that</p><p>Then since Qxx &#8868; is a rank-1 matrix, we have</p><p>Then, note that</p><p>yielding the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.4 Proof of Lemma A.5</head><p>This can be shown from the definition of Q-norm:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.5 Proof of Lemma A.6</head><p>Let &#960; &#964; = P &#958;t (&#8226;|F t-&#964; ) be a distribution over &#926;. By Assumption 2,</p><p>Furthermore, we know that &#926; W ij (&#958;)d&#960;(&#958;) = 0.</p><p>Thus,</p><p>The second inequality also follows similarly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.6 Proof of Lemma A.7</head><p>Note that x t = xt + x * and y t = &#563;t -J -1 22 J 21 xt + J -1 22 J 21 x * . Plugging these to w x t = W 11 (&#958; t )x t + W 12 (&#958; t )y t + u 1 (&#958; t ) and similarly to w y t yields the expressions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.7 Proof of Lemma A.8</head><p>By the recursion in <ref type="bibr">(5)</ref>,</p><p>Similarly, we have &#8741;&#563; t+1 &#8741; 2 &#8804; (1 + &#946;J max )&#8741;&#563; t &#8741; 2 + &#946;(J max &#954; y &#8741;x t &#8741; 2 + &#963; y ) + &#954; y &#945;(&#954; y J max &#8741;x t &#8741; 2 + J max &#8741;&#563; t &#8741; 2 + &#963; x ), &#8741;&#563; t+1 -&#563;t &#8741; 2 &#8804; &#946;(J max &#8741;&#563; t &#8741; 2 + J max &#954; y &#8741;x t &#8741; 2 + &#963; y ) + &#954; y &#945;(&#954; y J max &#8741;x t &#8741; 2 + J max &#8741;&#563; t &#8741; 2 + &#963; x ).</p><p>Adding two equations, &#954; y &#8741;x t+1 &#8741; 2 + &#8741;&#563; t+1 &#8741; 2 &#8804; (1 + 2&#946;J max )(&#954; y &#8741;x t &#8741; 2 + &#8741;&#563; t &#8741; 2 ) + &#946;&#963; y + &#945;&#954; y &#963; x .</p><p>3. For the last term, similarly, &#8741;E[W 11 (&#958; t )x t (&#563; t -&#563;t-&#964; ) &#8868; ]&#8741; 1 &#8804; W max E[&#8741;x t &#8741; 2 &#8741;&#563; t -&#563;t-&#964; &#8741; 2 ] &#8804; O(1)&#946;&#964; J 2 max &#8226; E[(&#954; y &#8741;x t &#8741; 2 + &#8741;&#563; t &#8741; 2 )&#8741;x t &#8741; 2 ] + O(1)&#964; J max E[&#8741;x t &#8741; 2 ](&#945;&#954; y &#963; x + &#946;&#963; y ) &#8804; O(1)&#946;&#964; J 2 max E[&#954; y &#8741;x t &#8741; 2 2 + &#8741;&#563; t &#8741; 2 2 ] + O(1)&#964; ((&#945; 2 /&#946;)&#954; y &#963; 2 x + &#946;&#963; 2 y ). Combining these inequalities, and given that &#946;&#964; &#8810; &#954; -2 y /J max in Assumption 4, we can conclude that &#8741;E[W 11 (&#958; t )x t &#563;&#8868; t ]&#8741; 1 &#8804; &#181; y 16&#954; y E[&#8741;&#563; t &#8741; 2 2 ] + O(1)&#946;&#964; J 2 max &#954; y E[&#8741;x t &#8741; 2 2 ] + O(1)&#964; (&#945; 2 /&#946;)&#954; y &#963; 2 x + &#946;&#963; 2 y . Similarly, we can also show that &#8741;E[W 12 (&#958; t )&#563; t &#563;&#8868; t ]&#8741; 1 &#8804; &#181; y 16&#954; y E[&#8741;&#563; t &#8741; 2 2 ] + O(1)&#946;&#964; J 2 max &#954; y E[&#8741;x t &#8741; 2 2 ] + O(1)&#964; ((&#945; 2 /&#946;)&#954; y &#963; 2 x + &#946;&#963; 2 y ). For the last one, we proceed as &#8741;E[u 1 (&#958; t )&#563; &#8868; t ]&#8741; 1 &#8804; &#8741;E[u 1 (&#958; t )&#563; &#8868; t-&#964; ]&#8741; 1 + &#8741;E[u 1 (&#958; t )(&#563; t -&#563;t-&#964; ) &#8868; ]&#8741; 1 &#8804; &#961; &#964; u max E[&#8741;&#563; t-&#964; &#8741; 2 ] + u max E[&#8741;&#563; t -&#563;t-&#964; &#8741; 2 ] &#8804; O(1)(&#961; &#964; + &#946;&#964; J max )u max E[&#954; y &#8741;x t &#8741; 2 + &#8741;&#563; t &#8741; 2 ] + &#964; u max (&#945;&#954; y &#963; x + &#946;&#963; y ) &#8804; O(1)&#946;&#964; J 2 max E[&#954; y &#8741;x t &#8741; 2 2 + &#8741;&#563; t &#8741; 2 2 ] + O(1)&#964; ((&#945; 2 /&#946;)&#954; y &#963; 2 x + &#946;&#963; 2 y ),</p><p>where in the last inequality, we use u max &#8804; &#963; y . Combining all the above inequalities yields the lemma.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.9 Proof of Lemma A.11</head><p>To begin with, we start with unfolding the expression as</p><p>E[w x t x&#8868; t ] = E[W 11 (&#958; t )x t x&#8868; t ] + E[W 12 (&#958; t )&#563; t x&#8868; t ] + E[u 1 (&#958; t )x &#8868; t ]. To proceed, we first note that E[W 11 (&#958; t )x t x&#8868; t ] = E[W 11 (&#958; t )x t-&#964; x&#8868; t-&#964; ] + E[W 11 (&#958; t )(x t -xt-&#964; )x &#8868; t-&#964; ] + E[W 11 (&#958; t )x t (x t -xt-&#964; ) &#8868; ].</p><p>For each term in the above, we have the following inequalities:</p><p>1. Using the mixing-time assumption, we can show that</p><p>2. For the next term, we apply Lemma A.8 and Corollary A.9:</p><p>1)&#945;&#964; J max W max &#8226; E[(&#954; y &#8741;x t &#8741; 2 + &#8741;&#563; t &#8741; 2 )&#8741;x t-&#964; &#8741; 2 ] + O(1)&#945;&#964; W max (&#963; x + &#946;J max &#963; y )E[&#8741;x t-&#964; &#8741; 2 ] &#8804; O(1) &#8226; &#945;&#964; J 2 max E[&#954; y &#8741;x t &#8741; 2 2 + &#8741;&#563; t &#8741; 2 2 + &#964; 2 &#945; 2 (&#963; 2 x + &#946; 2 J 2 max &#963; 2 y ) &#8804; &#181; x 32&#954; x E[&#8741;x t &#8741; 2 2 ] + O(1)&#945;&#964; J 2 max E[&#8741;&#563; t &#8741; 2 2 ] + (&#945;&#964; ) 3 (&#963; 2 x + &#946; 2 J 2 max &#963; 2 y ), where we use the condition that &#945;&#964; &#8810; 1/(J max &#954; 2 x ) and W max &#8804; J max . Combining these inequalities, and given that &#946;&#964; &#8810; &#954; -2 y /J max in Assumption 4, we can conclude that &#8741;E[W 11 (&#958; t )x t x&#8868; t ]&#8741; 1 &#8804; &#181; x 16&#954; x E[&#8741;x t &#8741; 2 2 ] + &#945;&#964; J 2 max E[&#8741;&#563; t &#8741; 2 2 ] + &#945;&#964; &#963; 2 x . We also need to check the cross term: E[W 12 (&#958; t )&#563; t x&#8868; t ] = E[W 12 (&#958; t )x t-&#964; &#563;&#8868; t-&#964; ] + E[W 12 (&#958; t )(&#563; t -&#563;t-&#964; )x &#8868; t-&#964; ] + E[W 12 (&#958; t )&#563; t (x t -xt-&#964; ) &#8868; ]. First term can be bounded similarly using the geometric mixing assumption. For the second term, &#8741;E[W 12 (&#958; t )(&#563; t -&#563;t-&#964; )x &#8868; t-&#964; ]&#8741; 1 &#8804; W max E[&#8741;(&#563; t -&#563;t-&#964; )x &#8868; t-&#964; &#8741; 1 ] &#8804; O(1)&#946;&#964; J 2 max &#8226; E[(&#954; y &#8741;x t &#8741; 2 + &#8741;&#563; t &#8741; 2 )&#8741;x t-&#964; &#8741; 2 ] + O(1)&#964; J max (&#945;&#954; y &#963; x + &#946;&#963; y )E[&#8741;x t-&#964; &#8741; 2 ] &#8804; &#181; x 32&#954; x E[&#8741;x t &#8741; 2 2 ] + O(1) &#946; 2 &#964; 2 J 4 max &#954; x &#181; x E[&#8741;&#563; t &#8741; 2 2 ] + O &#945; 2 &#964; 2 J 2 max &#954; 2 y &#954; x &#181; x &#963; 2 x + &#954; x &#946; 2 &#964; 2 J 2 max &#181; x &#963; 2 y . For the third term, similarly, we have &#8741;E[W 12 (&#958; t )&#563; t (x t -xt-&#964; ) &#8868; ]&#8741; 1 &#8804; W max E[&#8741;&#563; t (x t -xt-&#964; ) &#8868; &#8741; 1 ] &#8804; O(1)&#945;&#964; J 2 max &#8226; E[(&#954; y &#8741;x t &#8741; 2 + &#8741;&#563; t &#8741; 2 )&#8741;&#563; t &#8741; 2 ] + O(1)&#964; J max (&#945;&#963; x + &#946; 2 J max &#963; y )E[&#8741;&#563; t &#8741; 2 ] &#8804; &#181; x 32&#954; x E[&#8741;x t &#8741; 2 2 ] + O(1)&#964; J 2 max (&#945; + &#946; 2 J max )E[&#8741;&#563; t &#8741; 2 2 ] + O &#945;&#964; &#963; 2 x + &#946; 2 &#964; J max &#963; 2 y .</p><p>For the last one, we proceed as</p><p>&#8804; (&#961; &#964; + &#945;&#964; J max )u max E[&#954; y &#8741;x t &#8741; 2 + &#8741;&#563; t &#8741; 2 ] + &#964; u max &#945;&#963; x &#8804; &#945;&#964; J</p><p>2 max E[&#954; y &#8741;x t &#8741; 2 2 + &#8741;&#563; t &#8741; 2 2 ] + &#964; &#945;&#963; 2 x ,</p><p>where in the last inequality, we use u max &#8804; &#963; x . Combining all the above inequalities yields the lemma.</p></div></body>
		</text>
</TEI>
