<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Effectiveness of Constant Stepsize in Markovian LSA and Statistical Inference</title></titleStmt>
			<publicationStmt>
				<publisher>AAAI Press</publisher>
				<date>03/25/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10522285</idno>
					<idno type="doi">10.1609/aaai.v38i18.30028</idno>
					<title level='j'>Proceedings of the AAAI Conference on Artificial Intelligence</title>
<idno>2159-5399</idno>
<biblScope unit="volume">38</biblScope>
<biblScope unit="issue">18</biblScope>					

					<author>Dongyan Lucy Huo</author><author>Yudong Chen</author><author>Qiaomin Xie</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>In this paper, we study the effectiveness of using a constant stepsize in statistical inference via linear stochastic approximation (LSA) algorithms with Markovian data. After establishing a Central Limit Theorem (CLT), we outline an inference procedure that uses averaged LSA iterates to construct confidence intervals (CIs). Our procedure leverages the fast mixing property of constant-stepsize LSA for better covariance estimation and employs Richardson-Romberg (RR) extrapolation to reduce the bias induced by constant stepsize and Markovian data. We develop theoretical results for guiding stepsize selection in RR extrapolation, and identify several important settings where the bias provably vanishes even without extrapolation. We conduct extensive numerical experiments and compare against classical inference approaches. Our results show that using a constant stepsize enjoys easy hyperparameter tuning, fast convergence, and consistently better CI coverage, especially when data is limited.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Stochastic approximation (SA) algorithms use stochastic updates to iteratively approximate the solution to fixed-point equations. SA has wide applications, such as the stochastic gradient descent (SGD) algorithm for loss minimization and the Temporal Difference (TD) learning algorithm in reinforcement learning (RL) <ref type="bibr">(Sutton 1988)</ref>. Classical works on SA typically assume the stepsize sequence &#945; t is squaresummable and diminishing, i.e., t &#945; t = &#8734;, t &#945; 2 t &lt; &#8734;, under which asymptotic almost-sure convergence is well studied <ref type="bibr">(Robbins and Monro 1951;</ref><ref type="bibr">Blum 1954;</ref><ref type="bibr">Borkar and Meyn 2000)</ref>. Constant stepsize has gained popularity recently, particularly among practitioners, due to its fast initial convergence and easy hyperparameter tuning. A growing line of works studies the convergence properties of SA under constant stepsize, establishing upper bounds on the meansquared error (MSE) <ref type="bibr">(Lakshminarayanan and Szepesv&#225;ri 2018;</ref><ref type="bibr">Srikant and Ying 2019;</ref><ref type="bibr">Mou et al. 2020</ref><ref type="bibr">Mou et al. , 2021) )</ref> as well as weak convergence results <ref type="bibr">(Dieuleveut, Durmus, and Bach 2020;</ref><ref type="bibr">Yu et al. 2021;</ref><ref type="bibr">Huo, Chen, and Xie 2023a)</ref>.</p><p>Recent works have explored using SA and SGD iterates to perform statistical inference, e.g., constructing confidence intervals (CIs) around a point estimate <ref type="bibr">(Li et al. 2018;</ref><ref type="bibr">Chen et al. 2020;</ref><ref type="bibr">Li et al. 2022;</ref><ref type="bibr">Lee et al. 2022;</ref><ref type="bibr">Liu, Chen, and Shang 2023)</ref>. This approach is computationally cheap and scales well with the size and dimension of the dataset: SA updates are computed iteratively, without storing or multiple passes over the dataset. In comparison, classical inference techniques, such as bootstrapping, often require storing the entire dataset and repeating the estimation procedure, which may have prohibitive computational costs.</p><p>The aforementioned works on using SA for inference have focused on the diminishing stepsize paradigm, for which a mature convergence theory exists, ensuring the asymptotic correctness of the inference results. In contrast, constant-stepsize SA lacks last-iterate almost sure convergence: recent works have shown that the iterates converge only in distribution; moreover, the limit distribution may have a nonzero asymptotic bias due to the nonlinearity of the SA updates <ref type="bibr">(Dieuleveut, Durmus, and Bach 2020)</ref> or the underlying Markovian data <ref type="bibr">(Huo, Chen, and Xie 2023a)</ref>, and this bias cannot be eliminated by iterate averaging. Partly due to these considerations, inference with constant stepsize SA iterates has been largely overlooked in the literature.</p><p>We study the effectiveness of statistical inference using constant-stepsize SA iterates. We focus on linear stochastic approximation (LSA), i.e., &#952; t+1 = &#952; t + &#945;(A(x t )&#952; t + b(x t )), with a constant stepsize &#945; and Markovian data (x t ) t&#8805;0 . We first establish a Central Limit Theorem (CLT) for averaged Markovian LSA iterates. Built upon the CLT, we outline an inference procedure using averaged iterates and batch-mean covariance estimates. Our procedure leverages the fast mixing property of constant-stepsize updates for better covariance estimation and employs Richardson-Romberg (RR) extrapolation for bias reduction. We study two stepsize schemes in RR extrapolation. We further prove that with Markovian data, the asymptotic bias may vanish in several important settings, in which case inference with constantstepsize LSA iterates is effective even without RR extrapolation. We conduct extensive experiments to benchmark our procedure against conventional inference approaches. Our results demonstrate superior and robust performance of the constant stepsize paradigm, which enjoys fast convergence, good coverage properties, and easy parameter tuning.<ref type="foot">foot_0</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>Dating back to <ref type="bibr">Robbins and Monro (1951)</ref>, classical works on stochastic approximation typically assume i.i.d. data and a square-summable and diminishing stepsize sequence. Subsequent works propose what is now known as the Polyak-Ruppert iterate averaging <ref type="bibr">(Ruppert 1988;</ref><ref type="bibr">Polyak 1990</ref>) and establish a CLT for the averaged iterates <ref type="bibr">(Polyak and Juditsky 1992)</ref>. Recent work in <ref type="bibr">Borkar et al. (2023)</ref> establishes a CLT for scaled iterates for general contractive SA under diminishing stepsize and Markovian data.</p><p>Constant-stepsize SA and SGD have attracted growing attention recently. Several works study constant-stepsize LSA with i.i.d. data and establish finite-time MSE bounds and a CLT for the average iterates <ref type="bibr">(Lakshminarayanan and Szepesv&#225;ri 2018;</ref><ref type="bibr">Mou et al. 2020)</ref>. A parallel line of work studies SGD for both convex <ref type="bibr">(Dieuleveut, Durmus, and Bach 2020</ref>) and nonconvex (Yu et al. 2021) functions with i.i.d. data and identifies an asymptotic bias arising from the nonlinearity of the function. More recent works study LSA with Markovian data. The works in <ref type="bibr">Srikant and Ying (2019)</ref> and <ref type="bibr">Durmus et al. (2023)</ref> study convergence bounds for MSE, while the work in Huo, Chen, and Xie (2023a) establishes weak convergence of the LSA iterates and characterizes its asymptotic bias due to Markovian data.</p><p>Most related to this paper is a recent line of work on using SGD/SA iterates for statistical inference. The work in <ref type="bibr">Chen et al. (2020)</ref> considers SGD with i.i.d. data and strongly-convex functions and proposes two covariance matrix estimators. This result is generalized to &#966;-mixing data in <ref type="bibr">Liu, Chen, and Shang (2023)</ref>. The work in Zhu, <ref type="bibr">Chen, and Wu (2023)</ref> extends the batch-mean estimator in <ref type="bibr">Chen et al. (2020)</ref> to a fully online version. The work in <ref type="bibr">Lee et al. (2022)</ref> proposes a random scaling covariance estimator for robust online inference with SGD. Subsequent work in <ref type="bibr">Li, Liang, and Zhang (2023)</ref> investigates online statistical inference using nonlinear SA with Markovian data. The above works all consider diminishing stepsizes. The work in <ref type="bibr">Xie and Zhang (2022)</ref> extends the random scaling estimator for inference with SA under constant stepsizes but i.i.d. data. The work in <ref type="bibr">Li et al. (2018)</ref> studies statistical inference with SGD and i.i.d. data, using a small stepsize whose value remains constant throughout the iterations but scales inversely with the total number of iterations. In contrast, we consider constant stepsizes whose values are independent of the total number of iterations and thus substantially larger than the typical stepsize values in <ref type="bibr">Li et al. (2018)</ref>.</p><p>RR extrapolation is a classical technique from numerical analysis to improve approximation errors. It has been used in <ref type="bibr">Dieuleveut, Durmus, and Bach (2020)</ref> for SGD and Huo, Chen, and Xie (2023a) for LSA. See the survey by <ref type="bibr">Bach (2021)</ref> for the use of RR extrapolation in other data science and machine learning problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Problem Setup</head><p>In this section, we formally set up the problem and introduce the assumptions. Let (x t ) t&#8805;0 be a time-homogeneous stochastic process on a Borel state space X with stationary distribution &#960;. Define the target vector &#952; * as the solution to the steady-state equation</p><p>where A : X &#8594; R d&#215;d and b : X &#8594; R d are deterministic functions on X . To approximate &#952; * , we consider the following linear stochastic approximation iteraion:</p><p>where (&#945; t ) t&#8805;0 is the stepsize sequence. We focus on using a constant stepsize, i.e., &#945; t &#8801; &#945; for all t &#8805; 0. (We omit the superscript (&#945;) when the stepsize is clear from the context.) We emphasize that the constant stepsize considered here is independent of the total number of iterations. This is different from the work in <ref type="bibr">Li et al. (2018)</ref>, which pre-specifies the total number of iterations T and uses a small fixed stepsize of the form &#945; t = T -&#946; , &#8704;0 &#8804; t &#8804; T for some &#946; &gt; 0.</p><p>We make the following standard assumptions. Assumption 1. (x t ) t&#8805;0 is a uniformly ergodic Markov chain with transition kernel P and a unique stationary distribution &#960;.</p><p>Assumpiton 1 is common in the literature on Markovian SA <ref type="bibr">(Bhandari, Russo, and Singal 2021;</ref><ref type="bibr">Durmus et al. 2023;</ref><ref type="bibr">Huo, Chen, and Xie 2023a)</ref>. Uniform ergodicity ensures that the distribution of x t converges geometrically to &#960; from any initial distribution. For example, all irreducible, aperiodic, and finite state space Markov chains are uniformly ergodic. Assumption 2.</p><p>The Hurwitz condition is again standard and ensures the stability of a dynamic system <ref type="bibr">(Srikant and Ying 2019;</ref><ref type="bibr">Durmus et al. 2023;</ref><ref type="bibr">Huo, Chen, and Xie 2023a)</ref>. This assumption is satisfied in, e.g., SGD for minimizing strongly convex quadratics and the linear TD algorithm in RL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Central Limit Theorem</head><p>Our first result is a CLT for averaged LSA iterates, which lays the theoretical foundation for the inference procedure developed later. To state the CLT, we recall a known result for constant-stepsize Markovian LSA: the data-iterate pair (x t , &#952; t ) converge weakly to a unique limit distribution, denoted by (x &#8734; , &#952; &#8734; ) &#8764; &#181; <ref type="bibr">(Huo, Chen, and Xie 2023a)</ref>. Theorem 1 (CLT). Under Assumptions 1-2, there exists a threshold &#945; 0 &#8712; (0, 1) such that for all &#945; &#8712; (0, &#945; 0 ), we have</p><p>where &#952;T :</p><p>The proof is deferred to the appendix. Theorem 1 extends existing CLT results, which focus on either LSA with i.i.d. data <ref type="bibr">(Mou et al. 2020;</ref><ref type="bibr">Xie and Zhang 2022)</ref> or SA with diminishing stepsize <ref type="bibr">(Borkar et al. 2023)</ref>. When using Markovian data (x t ) t&#8805;0 , the iterate sequence (&#952; t ) t&#8805;0 is no longer a Markov chain on its own. Instead, we need to consider the joint process (x t , &#952; t ) t&#8805;0 , which is a time-homogeneous Markov chain thanks to the use of a constant stepsize, and build the CLT accordingly. Moreover, a number of existing</p><p>The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)</p><p>Markov Chain CLT results require a one-step contraction property of the form W p,q (&#181;Q, &#957;Q) &#8804; &#947;W p,q (&#181;, &#957;), where Q is the Markovian transition kernel, &#947; &lt; 1 and W p,q is an appropriate Wasserstein distance between distributions (Xie and Zhang 2022). Our Markov chain (x t , &#952; t ) t&#8805;0 does not enjoy such a nice one-step contraction property, and hence proving our CLT requires additional work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Statistical Inference Procedure Using LSA</head><p>We next present a statistical inference procedure using the averaged LSA iterates with constant stepsize and Markovian data. This procedure can be combined with RR extrapolation to construct confidence intervals for the target vector &#952; * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Asymptotic Bias and RR Extrapolation</head><p>It has been shown in <ref type="bibr">Huo, Chen, and Xie (2023a)</ref> </p><p>&#8734; ] is biased with respect to &#952; * and admits expansion: i) , where B (i) are vectors independent from the stepsize &#945;. Theorem 1 guarantees that the averaged iterates &#952;T are asymptotically normal, but E[&#952;</p><p>&#8734; ]-&#952; * scales with &#945;. To construct CIs with good coverage properties, it is important to reduce the asymptotic bias.</p><p>In light of the bias expansion, we can employ Richardson-Romberg (RR) extrapolation to reduce the asymptotic bias to a higher order polynomial of the stepsize &#945; <ref type="bibr">(Dieuleveut, Durmus, and Bach 2020;</ref><ref type="bibr">Bach 2021;</ref><ref type="bibr">Huo, Chen, and Xie 2023a)</ref>. Specifically, we run the LSA update (1) with M constant stepizes A = {&#945; 1 , . . . , &#945; M } and compute a linear combination of the resulting iterates:</p><p>We carefully choose the coefficients {h m } to satisfy</p><p>Using the aforementioned expansion, one sees that the bias</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Inference Procedure</head><p>We now describe the inference procedure, which follows from the procedure in <ref type="bibr">Li et al. (2018)</ref> originally designed for i.i.d. data and a small stepsize.</p><p>Point Estimation and Batching Given a trajectory of (x t ) t&#8805;0 sampled from a Markov chain, we run LSA with constant stepsize &#945; and obtain iterates (&#952;</p><p>t=0 are considered as initial burn-in and are not used in the inference procedure. For the remaining iterates, we divide them equally into K batches of size n. Within each batch, we discard the first n 0 (&#8805; 0) iterates and compute the average of the remaining iterates:</p><p>(b+n)+n0 , . . . , &#952;</p><p>Hence, for the k-th batch, we compute the point estimator &#952;(&#945;)</p><p>l . As such, we obtain a total of K batch-mean estimators { &#952;(&#945;) k }, which will be used for statistical inference. Note that we only need to save the running average of each batch-mean estimator without the necessity to store the entire trajectory (&#952;</p><p>Before delving into the construction of CIs, we briefly remark on several design choices. The initial b iterates are considered as burn-in and are omitted, as these iterates are far away from stationarity, and thus may have substantial optimization errors. The first n 0 iterates of each batch are also discarded, to reduce the correlation of the remaining iterates across batches, i.e., in the order of exp(-&#945;n 0 ).</p><p>Confidence Interval Construction Now that the CLT for the average iterates of Markovian LSA with constant stepsizes has been established, we construct estimators for</p><p>For variance estimation, we adapt an estimator that has been considered in <ref type="bibr">Flegal and Jones (2010)</ref>; <ref type="bibr">Chen et al. (2020)</ref>, and <ref type="bibr">Xie and Zhang (2022)</ref> to our problem. Given { &#952;(&#945;) k }, the variance estimator is computed as</p><p>It has been shown in <ref type="bibr">Flegal and Jones (2010)</ref> that &#931; is a consistent estimator of &#931; * in Theorem 1 as n, K &#8594; &#8734;, Hence, for inference with LSA with stepsize &#945;, for q &#8712; (0, 1), we construct the</p><p>In subsequent experiments, we focus on 95% CIs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Combining With RR Extrapolation</head><p>Next, we apply RR extrapolation in addition to the abovedelineated procedure to construct confidence intervals that have better coverage properties of &#952; * .</p><p>To apply RR extrapolation, we first select a set of M distinct stepsizes, i.e., A = {&#945; 1 , &#945; 2 , . . . , &#945; M } and run LSA iterates with these stepsizes simultaneously by using the same underlying data stream (x t ) t&#8805;0 . For each trajectory of iterates (&#952; (&#945;m) t</p><p>) t&#8805;0 , we follow the inference procedure and obtain K batch-mean estimators { &#952;(&#945;m) k } K k=1 . To obtain the RR extrapolated estimator, we linearly combine the k-th estimates across the M trajectories and obtain</p><p>, with {h m } computed according to (2). We then conduct statistical inference using the iterates { &#952; A k } and build confidence intervals following the similar methodology described above for a single trajectory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theoretical Guarantees</head><p>Next, we provide additional theoretical analysis that helps us achieve better statistical inference performance with RR extrapolated iterates. Please refer to the appendix for the proofs of propositions in this section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Stepsize Selection in RR Extrapolation</head><p>As we discussed, one way to improve the coverage probability of &#952; * is to reduce the bias via RR extrapolation. However, RR extrapolation does not come for free, as the coefficients {h m } solving (2) may blow up, thus resulting in large variance and offsetting the benefits of bias reduction. {h m } values are uniquely determined by inverting a Vandermonde matrix, which is infamous for being ill-conditioned when the "roots" {&#945; m } are positive real numbers. Therefore, when employing RR extrapolation, we need to carefully select M and {&#945; m } to maximize the benefits of bias reduction.</p><p>We study two stepsize selection schedules, namely geometric decaying and equidistant sequences. We assume that &#945; m decreases in value as m increases. We establish an upper bound to the variance of &#952; &#8734; in each stepsize regime, which would offer some insight and guidance on stepsize selection.</p><p>The geometric decay schedule is not a conventional choice in numerical analysis, the field from which RR extrapolation originates, but it is frequently employed in machine learning. Proposition 2. Given unique stepsizes A = {&#945; 1 , . . . , &#945; M }. Assume &#945; 1 &lt; 1 and &#945; m = &#945; 1 /c m-1 with c &#8805; 2. We observe the following properties.</p><p>) . It is noteworthy that the variance upper bound for geometric decaying stepsizes does not depend on the number of stepsizes M used in the extrapolation, suggesting that the variance will not blow up as M &#8594; &#8734;. Nonetheless, one problem with geometric decay is that the stepsize would decay too quickly to near zero as M increases. As such, RR iterates with a small constant stepsize mixes slowly and may need a much longer trajectory for the iterates to converge.</p><p>The equidistant decay schedule has been studied in numerical analysis <ref type="bibr">(Gautschi 1990</ref>). We study the behavior of a generic equidistant decay schedule in RR exploration.</p><p>M -1 for m = 1, . . . , M. We observe the following properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">|h</head><p>). When the stepsize sequence decays in equidistance, stepsizes do not decrease as quickly to zero based on the choice of a, b. However, the upper bound of variance now is of order M M , which suggests that the variance could blow up quickly as M increases, potentially offsetting the benefits from reduced bias.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Zero Bias Special Cases</head><p>As discussed earlier, RR extrapolation reduces the asymptotic bias and hence increases the asymptotic coverage probability of &#952; * based on the CI constructed for E[&#952; &#8734; ]. Hence, the Markovian underlying data and the presence of asymptotic bias should not discourage one from using constant stepsize LSA iterates for statistical inference.</p><p>In this section, we would like to highlight that Markovian data need not be a sufficient condition for the presence of asymptotic bias in LSA. That is, there are several commonly observed scenarios where no asymptotic bias is present, even when the underlying data is Markovian.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Independent Multiplicative Noise</head><p>In this scenario, we expand our Markov chain state space to incorporate an independent bounded zero-mean random variable, i.e., x t = (s t , &#958; t ), where (s t ) t&#8805;0 is the uniformly ergodic Markov chain and (&#958; t ) t&#8805;0 is uniformly bounded, i.e., &#958; t &#8804; u, and i.i.d. sampled from distribution &#958; with E[&#958;] = 0. Consider the following Markovian LSA updates,</p><p>where A(s t , &#958; t ) = &#256; + &#958; t and b(s t , &#958; t ) = b(s t ). It can be easily verified that the above LSA iteration satisfies the required Assumptions 1-2. Proposition 4. Consider the LSA iteration of (3), there exists some threshold &#945; 0 &#8712; (0, 1) such that &#8704;&#945; &#8712; [0, &#945; 0 ), (x t , &#952; t ) t&#8805;0 converges weakly to a unique stationary distribution. Moreover,</p><p>The above proposition implies that when the Markovian noise is only additive, the limiting expectation of the iterates converges to &#952; * . Hence, our CLT on E[&#952; &#8734; ] allows us to construct CI and perform statistical inference directly on &#952; * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Independent Additive Noise in Linear Regression</head><p>We consider an independent additive noise setting in linear regression, which has been previously discussed in <ref type="bibr">Bresler et al. (2020)</ref> and <ref type="bibr">Huo, Chen, and Xie (2023a)</ref>.</p><p>In this linear regression problem, independent observations (s t ) t&#8805;0 are sequentially generated from a uniformly ergodic Markov chain with stationary distribution &#957; and E &#957; [s t s t ] is positive definite. y t = s t w * + t , where t is an i.i.d. zero-mean noise with variance &#963; 2 . The SGD iterates to estimate w * are updated as w t+1 = w t&#945;s t w t , s ty t .</p><p>Casting the problem under the LSA framework, we have w t+1 = w t + &#945;(A(x t )w t + b(x t )) with x t = (s t , t ), A(x t ) = -s t s t and b(x t ) = s t (s t w * + t ). It has been shown in Huo, Chen, and Xie (2023a) that LSA under constant stepsizes has zero asymptotic bias in this setup.</p><p>Realizable Linear-TD Learning in RL In this section, we specialize our discussion on linear-TD learning in RL.</p><p>TD learning algorithm (Sutton 1988) is an important algorithm in RL for policy evaluation. Potentially equipped with linear function approximation, it is a special case of LSA.</p><p>We model the linear-TD problem with a Markov Reward Problem (MRP) (S, P S , r, &#947;). The value function V :</p><p>, where &#966; : S &#8594; R d is a known feature map of finite rank d and &#952; is the unknown weight vector to be estimated. The feature map is normalized such that &#966; max := sup s&#8712;S &#966;(s) 2 &#8804; 1 &#8730; 1+&#947; . We consider the semi-simulator regime, in which the iterates are updated in the following fashion,</p><p>4) where (s t ) t&#8805;0 is a Markov chain and s next t is sampled independently from P S (s t , &#8226;) conditioned on s t . Proposition 5. Assuming (s t ) t&#8805;0 is a uniformly ergodic Markov chain on state space S with s 0 &#8764; &#960; S . Assuming the linear-TD is realizable, i.e., &#8707;v &#8712; R d such that V (s) = &#966;(s) v for all s &#8712; S. The linear-TD iterates of (4) converge without asymptotic bias, i.e.,</p><p>The realizability assumption is not particularly restricting, as there are a number of RL problems that satisfy this assumption, such as TD learning on tabular MDP or linear MDP. Hence, if we have such a structured linear-TD problem, we could use the iterates obtained in the semi-simulator setting to perform statistical inference on the weight vector &#952; estimation and subsequently the value function directly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Numerical Experiments</head><p>We conduct extensive numerical experiments to examine our proposed inference procedure and stepsize selection guidelines in RR extrapolation. We present our main results in this section. Additional sets of experiments and detailed experiment designs are included in the appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Inference Performance Comparison</head><p>We examine the empirical performance of our proposed inference procedure with constant stepsizes and RR extrapolation. We consider LSA problems in dimension d = 5 for a finite state, irreducible, and aperiodic Markov chain with N = 10 states. We generate the transition probability P , and the functions A and b randomly; see the appendix for details. We construct CIs using the LSA iterates. We examine the performance from three aspects, namely the &#8467; 2 error of the point estimate to the target vector, i.e., &#952;&#952; * 2 , the coordinate-wise CI width and coverage probabilities.</p><p>We mainly study under constant stepsizes &#945; = 0.2 and &#945; = 0.02. The two choices of constant stepsizes are of different scales, allowing us to demonstrate extreme effects in convergence and inference performance. RR extrapolation is conducted using the two constant stepsizes. For comparison, we also consider diminishing stepsizes with initial stepsize &#945; 0 being 0.2 and 0.02 respectively, and a diminishing stepsize &#945; t = &#945; 0 t -0.5 for t &#8805; 1. The diminishing rate t -0.5 is chosen as it is on the boundary of square-summable assumption and has often been observed with the best empirical performance among t -&#946; with &#946; &#8712; [0.5, 1].</p><p>Baseline Comparison for I.I.D. LSA We first conduct a cross-study to examine the performance of inference with constant and diminishing stepsize under i.i.d. data. We notice that constant stepsizes slightly outperform the 0.2t -0.5 diminishing stepsize. Please refer to the appendix for detailed experiment design and results.</p><p>Performance Comparison for Markovian LSA We examine 100 different LSA problems of the same dimension |X | = 10 and d = 5. For each LSA setup, the parameters (P, A, b) are generated randomly, and we simulate 100 trajectories (x t ) t&#8805;0 of length 10 5 , run LSA iterates with the above-described stepsize regimes, and perform inference. We summarize the distribution of the performance across the 100 cases in Table <ref type="table">1</ref>.</p><p>With Markovian data, the LSA with constant stepsize converges with nonzero asymptotic bias. The presence of bias is rather obvious when &#945; = 0.2, evident from the large &#8467; 2 error. Hence, CIs constructed with &#945; = 0.2 iterates have the worst coverage probabilities. Reducing the stepsize to &#945; = 0.02 will reduce the asymptotic bias, and hence CIs constructed with &#945; = 0.02 iterates have significantly better performance.</p><p>When the RR extrapolation technique is employed, the confidence intervals constructed enjoy the best coverage properties, with the smallest &#8467; 2 error, comparable CI width, and higher coverage probabilities of &#952; * . Moreover, the median coverage probability is around the targeted 95%. Table 2: Inference comparison of different batch numbers. &#8467; 2 and "CI" values are of unit 10 -3 . Both CI width and coverage probability are for the 1-st coordinate estimate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Batch Number Selection</head><p>We further inspect the impact of the number of batches used in inference, i.e., the value K in the confidence interval construction procedure. We focus on one specific LSA problem with |X | = 10 and d = 5</p><p>Markovian data. The trajectory length is set at 10 6 and the number of batches varies. each combination of and number of batches we run independent runs and record the mean and standard error in Table <ref type="table">2</ref>.</p><p>For constant stepsizes, as we compare across different rows, the mean estimates are at the same scale, and not influenced much by the choice of batch number. However, for diminishing stepsizes, as the batch number increases, the CI widths decrease quickly, which drastically reduces the coverage probability and implies an underestimation of the variance. As the stepsize decays in the diminishing stepsize sequence, the iterates become increasingly correlated, and hence a longer batch size is needed to overcome the strong correlation. Therefore, the number of batches used in diminishing stepsize regime cannot arbitrarily increase. In contrast, iterates under constant stepsize mixes at the same rate and is more robust to the number of batches used.</p><p>Trajectory Length We next investigate the impact of increasing trajectory length with a range of stepsizes to better visualize the trend. For each trajectory length, we fix the number of batches as T 0.3 as recommended in <ref type="bibr">Chen et al. (2020)</ref>. The statistics are in Table <ref type="table">3</ref>.</p><p>We observe that inference with RR iterates consistently presenting the best results. When the trajectory length is at 10 3 , the iterates under various stepsize regimes would still be distant from &#952; * , which explains the mediocre coverage of &#952; * . As the trajectory length increases, the iterates gradually approach stationarity. Hence, the inference performance deteriorates for &#945; = 0.2 with 0 coverage probability, as the iterates converge to E[&#952; &#8734; ], which is asymptotically biased from &#952; * . On the other hand, as the trajectory length is longer, the iterates under diminishing stepsize converge closer to &#952; * . Therefore, the inference performance with diminishing stepsize improves. Inference with RR extrapolated iterates generally gives satisfactory performance. Hence, we would like to note that inference with constant stepsize might be especially useful when the simulation trajectory length budget is limited, as diminishing stepsize iterates struggle to output good inference results.</p><p>Comparison Against Bootsrapping In all previous experiments, we are comparing inference using SA iterates under different stepsize regimes. Here, we compare to bootstrapping, a more classical approach to statistical inference. Detailed experimental design can be found in the appendix.</p><p>While constant stepsize together with RR obtains 95.2% coverage with &#8467; 2 error 1.0 &#215; 10 -5 and CI width 0.00134, bootstrapping achieves 93.4% coverage with &#8467; 2 error 2.0 &#215; 10 -3 and CI width 0.00411. The large &#8467; 2 error and wide confidence interval of bootstrapping suggest that it may not be able to handle correlated data efficiently.</p><p>Bootstrapping requires the entire data set to be stored, requiring O(T d ) memory, whereas inference with SA iterates can accommodate online data, hence O(d) memory. Inference with SA iterates only needs first-order information, while bootstrapping may need higher-order information, making it potentially computationally prohibitive.</p><p>Extending to Nonlinear SA Lastly, we examine the performance of our proposed technique in an example of logistic regression with unbounded Markovian data, to demonstrate that our proposed technique is robust in a wide range of settings, even if not currently addressed by our theory.</p><p>The categorical data (x t , y t ) arrives online, with x t sequentially sampled from a 2-dimensional Gaussian AR(1) process, and y t &#8764; Bernoulli( 1 1+e -w * x t )) with w * &#8712; R 2 being a random unit vector. This setting is non-Hurwitz, nonlinear, and with an unbounded underlying Markov chain, but our proposed inference procedure still exhibits similar performance as seen in Table <ref type="table">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RR Stepsize Selection</head><p>In this section, we numerically examine the impacts of different stepsize selection decisions in RR extrapolation.</p><p>Comparing Different Regimes We first compare the geometric and equidistant regimes. To ensure a fair comparison, we keep the range of the stepsize selection the same across the two regimes, so the difference would come solely from how the stepsize decays within the range. The range is determined by a dyadic decaying stepsize sequence, i.e. the smallest stepsize across the two regimes is fixed at &#945; M = &#945; 1 /2 (M -1) . Under this setup, when the RR order is 2, there is no difference between these two schedules, so we start the comparison from the extrapolation with 3 stepsizes.</p><p>As depicted in Figure <ref type="figure">1</ref>, when the order of extrapolation increases to around 5, the benefits of RR extrapolation have been mostly exploited. The difference between the two decaying schedules is more significant when the order is low.</p><p>Length Comparison table 0.2 0.15 0.1 0.05 0.02 RR 0.2 &#8730; t 0.02 &#8730; t 10 3 &#8467;2 2.84 (0.06) 2.55 (0.06) 2.48 (0.05) 2.35 (0.05) 2.28 (0.05) 2.28 (0.05) 2.36 (0.05) 4.28 (0.07) CI 4.16 (0.07) 3.91 (0.06) 3.81 (0.06) 3.68 (0.06) 3.71 (0.06) 3.76 (0.06) 3.10 (0.05) 1.01 (0.02) Cov 82.2 (1.7) 84.0 (1.6) 83.0 (1.7) 83.6 (1.7) 83.6 (1.7) 83.4 (1.7) 76.8 (1.9) 24.8 (1.9) 10 4 &#8467;2 1.15 (0.02) 0.99 (0.02) 0.89 (0.02) 0.79 (0.02) 0.73 (0.02) 0.72 (0.02) 0.73 (0.02) 1.14 (0.02) CI 1.44 (0.02) 1.38 (0.01) 1.34 (0.01) 1.31 (0.01) 1.29 (0.01) 1.30 (0.01) 1.12 (0.01) 0.67 (0.01) Cov 75.2 (1.9) 81.8 (1.7) 85.2 (1.6) 88.8 (1.4) 90.0 (1.3) 89.2 (1.4) 85.4 (1.6) 56.0 (2.2) 10 5 &#8467;2 0.82 (0.01) 0.61 (0.01) 0.44 (0.01) 0.29 (0.01) 0.24 (0.01) 0.23 (0.01) 0.22 (0.01) 0.24 (0.01) CI 0.46 (0.00) 0.43 (0.01) (0.01) 0.41 (0.00) 0.41 (0.00) 0.38 (0.00) 0.29 (0.00) Cov 0.04 (0.9) 22.2 (1.8) 56.4 (2.2) 83.2 (1.7) 88.2 (1.4) 91.2 (1.3) 90.0 (1.3) 79.2 (1.8) 10 6</p><p>&#8467;2 0.81 (0.00) 0.57 (0.00) 0.38 (0.00) 0.20 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) CI 0.15 (0.00) 0.15 (0.00) 0.14 (0.00) 0.14 (0.00) 0.13 (0.00) 0.13 (0.00) 0.13 (0.00) 0.11 (0.00) Cov 0 (0) 0 ( <ref type="formula">0</ref>) 0 (0) 19.0 (1.8) 80.2 (1.8) 95.2 (1.0) 92.8 (1.2) 90.2 (1.3) Table 4: Inference comparison of different trajectory lengths. &#8467; 2 and "CI" values are of unit 10 -2 . Both CI width and coverage probability are for the 1-st coordinate estimate. As shown in Figure <ref type="figure">2</ref>, the smaller b/(M -1) is, the worse the RR extrapolation performance is. This phenomenon is in line with Proposition 3, as b/(M -1) inversely scales with the variance upper bound. What is surprising is that when using small b/(M -1), increasing the number of the stepsizes used does not always reduce the &#8467; 2 error of the extrapolated iterates. This may be due to the proximity of stepsizes, leading to large coefficients {h m }, increasing the variance and offsetting the reduction in bias. Hence, this suggests that for RR extrapolation to be effective, stepsizes should not be too close to each other, but should explore a range of values.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>In this paper, we demonstrate the effectiveness of statistical inference with LSA iterates under Markovian data, constant stepsizes, and RR extrapolation. An immediate next step is to provide theoretical results to justify the validity of the CI constructed with Markovian LSA iterates with constant stepsize, as the CI is non-consistent. Further directions include: (a) develop an anytime variance estimator so that the proposed inference procedure can be fully online; (b) given a fixed simulation length, how one should decide the order of RR extrapolation to achieve decent inference performance.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>An extended paper with appendix can be found at Huo, Chen, and Xie (2023b).The Thirty-Eighth AAAI Conference on Artificial Intelligence </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>The Thirty-Eighth AAAI Conference on Artificial Intelligence </p></note>
		</body>
		</text>
</TEI>
