<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Model-Free Offline Reinforcement Learning with Enhanced Robustness</title></titleStmt>
			<publicationStmt>
				<publisher>The Thirteenth International Conference on Learning Representations</publisher>
				<date>04/24/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10591460</idno>
					<idno type="doi"></idno>
					
					<author>Chi Zhang</author><author>Farhat U Zain</author><author>George Atia</author><author>Yue Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Offline reinforcement learning (RL) has gained considerable attention for its ability to learn policies from pre-collected data without real-time interaction, which makes it particularly useful for high-risk applications. However, due to its reliance on offline datasets, existing works inevitably introduce assumptions to ensure effective learning, which, however, often lead to a trade-off between robustness to model mismatch and scalability to large environments. In this paper, we enhance both aspects with a novel double-pessimism principle, which conservatively estimates performance and accounts for both limited data and potential model mismatches, two major reasons for the previous trade-off. We then propose a universal, modelfree algorithm to learn a policy that is robust to potential environment mismatches, which enhances robustness in a scalable manner. Furthermore, we provide a sample complexity analysis of our algorithm when the mismatch is modeled by the l α -norm, which also theoretically demonstrates the efficiency of our method. Extensive experiments further demonstrate that our approach significantly improves robustness in a more scalable manner than existing methods.
PRELIMINARIES
FINITE-HORIZON MARKOV DECISION PROCESS (MDP)A finite-horizon MDP is represented by M = S, A, H, P ≜ {P h } H h=1 , r ≜ {r h } H h=1 , where S and A are the finite state and action spaces of size S and A, respectively, and H is the horizon length. P h = (s,a)∈S×A P h,s,a (finite-horizon), P = (s,a)∈S×A P s,a (infinite-horizon).(3)The performance of a policy in an RMDP is evaluated based on its worst-case value function over all the instances in the uncertainty set. Specifically, the finite-horizon robust value functions V π = {V π h } H h=1 and the infinite-horizon robust value functions V π are defined aswhere the infimum is taken over the uncertainty set of transition kernels. For a given initial state distribution ρ ∈ ∆(S), we write the expected robust performance asThe goal of an RMDP is to learn a policy that optimizes the worst-case performance, or equivalently, the robust value functions. Such a policy is referred to as an optimal robust policy:π * ≜ arg max π V π (ρ), (infinite-horizon).]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Traditional reinforcement learning (RL) <ref type="bibr">(Sutton &amp; Barto, 2018)</ref> optimizes an agent's performance through iterative trial-and-error interactions with the environment, and has shown significant success in many areas such as video games <ref type="bibr">(Wei et al., 2022;</ref><ref type="bibr">Liu et al., 2022a)</ref>. However, such an online learning scheme can be costly or unsafe in real-world applications. For instance, in domains including autonomous driving <ref type="bibr">(Kiran et al., 2021)</ref>, stock market trading <ref type="bibr">(Kabbani &amp; Duman, 2022)</ref>, and healthcare <ref type="bibr">(Yu et al., 2021)</ref>, poor decisions can have significant consequences, making extensive explorations impractical. To address them, offline RL has been developed <ref type="bibr">(Lange et al., 2012;</ref><ref type="bibr">Levine et al., 2020)</ref>, enabling agents to learn from pre-collected datasets, offering a more reliable framework.</p><p>Since offline RL relies heavily on pre-collected datasets, the quality of these datasets largely determines performance. It is hence unclear whether satisfactory performance can be achieved for complex problems with a relatively limited dataset. In this context, two key challenges in improving offline RL performance have been studied. The first is scalability-the ability to handle large-scale problems. Without real-time interaction, learning an effective policy for large-scale problems from a limited dataset, which may not fully cover the entire state-action space, can be challenging. Recent research has focused on improving scalability by adapting model-free algorithms <ref type="bibr">(Shi et al., 2022;</ref><ref type="bibr">Yan et al., 2022;</ref><ref type="bibr">Laroche et al., 2019;</ref><ref type="bibr">Fujimoto et al., 2019;</ref><ref type="bibr">Ghasemipour et al., 2021;</ref><ref type="bibr">Kumar et al., 2019;</ref><ref type="bibr">Wu et al., 2019;</ref><ref type="bibr">Siegel et al., 2020)</ref> and leveraging function approximation techniques <ref type="bibr">(Ross &amp; Bagnell, 2012;</ref><ref type="bibr">Liu et al., 2020;</ref><ref type="bibr">Xie et al., 2021a;</ref><ref type="bibr">Yin et al., 2021a;</ref><ref type="bibr">Xie &amp; Jiang, 2021;</ref><ref type="bibr">Jiang &amp; Huang, 2020)</ref>. However, due to the complexity of large environments, many of these approaches assume that the dataset sufficiently represents the full deployment environment, typically presuming that the deployment environment is identical to the one from which the data was collected.</p><p>However, this assumption can be too restrictive. Static datasets only capture the environment at the time of data collection, but real-world applications frequently face environmental uncertainty due to perturbations or non-stationarity. This mismatch between the data collection and deployment environments, commonly known as the sim-to-real gap <ref type="bibr">(Zhao et al., 2020)</ref>, can cause significant performance degradation during deployment. Therefore, it is crucial to enhance the robustness of offline RL to ensure that the learned policies can perform reliably in the presence of such uncertainties. A promising solution is to adapt robust RL frameworks <ref type="bibr">(Iyengar, 2005;</ref><ref type="bibr">Nilim &amp; El Ghaoui, 2004)</ref> to the offline setting, as explored recently in <ref type="bibr">(Shi &amp; Chi, 2022;</ref><ref type="bibr">Blanchet et al., 2023)</ref>. However, these methods often come at the cost of scalability. Due to their inherent structure, robust RL methods typically rely on dynamic planning, which requires knowledge of the full transition dynamics, and are predominantly model-based. This necessitates learning and storing a complete transition model, which is resource-intensive <ref type="bibr">(Zhang et al., 2021a)</ref> and limits scalability for large-scale problems.</p><p>Recognizing the limitations of current methods and the challenges posed by large-scale problems and model uncertainty, a trade-off between robustness and scalability becomes apparent. Enhancing one typically comes at the expense of the other. This naturally leads to the following question:</p><p>Can we develop a unified framework that enhances both scalability and robustness in offline RL?</p><p>In this paper, we address this question by presenting a model-free algorithm to learn a policy that is both robust to model uncertainty and scalable to large-scale problems. Our method introduces a principle of double pessimism to simultaneously address two key sources of uncertainty: (1) the uncertainty arising from inaccurate estimations due to the underexplored datasets, and (2) model mismatch between the data collection and deployment environments. We then propose a streamlined conceptual framework, design a model-free algorithm, and provide the first theoretical guarantee of convergence and robustness of our approach. Our contributions can be summarized as follows.</p><p>&#8226; A Double-Pessimism Principle for Offline RL with Model Mismatch. We begin by framing the challenge of enhancing robustness in offline RL within an offline robust RL framework, where an uncertainty set captures potential environmental mismatches. To solve offline robust RL in a scalable manner, we propose the double-pessimism principle that does not require transition kernel estimations. This principle maintains a conservative estimate of robust performance, obtained directly from data collection without requiring model estimation. We then introduce the first model-free pessimistic robust Q-learning algorithm.</p><p>Our algorithm optimizes performance under model mismatch using an offline dataset, while offering greater memory efficiency and more scalability than previous methods.</p><p>&#8226; First and Near-Optimal Model-Free Algorithm for Offline Robust RL. We provide a rigorous sample complexity analysis for our model-free double-pessimistic robust Q-learning algorithm under the widely used l &#945; -norm uncertainty set. Our analysis shows that, given a dataset satisfying the partial coverage condition (to be introduced later), our algorithm can identify an optimal robust policy with near-optimal sample complexity, comparable to that of model-based offline robust RL and model-free offline non-robust RL. This represents the first sample complexity analysis for model-free robust offline RL, demonstrating its applicability to large-scale problems that require high data efficiency.</p><p>&#8226; Numerical Experimental Verification of Enhanced Robustness. We conduct extensive numerical experiments to demonstrate the improvements in robustness achieved by our algorithms in both simulated environments <ref type="bibr">(Archibald et al., 1995)</ref> and real physics-based Classic Control problems <ref type="bibr">(Brockman et al., 2016)</ref>. In each case, our algorithm consistently outperforms existing methods in handling model uncertainty, showcasing its enhanced ability to maintain stable performance across a wide range of environmental perturbations. Moreover, our approach demonstrates superior scalability stemming directly from our modelfree algorithm design, as shown by its effectiveness in solving more complex Classic Control problems with robustness guarantees, which have proven difficult or unsolvable for previous model-based robust methods.</p><p>The probability transition kernel P h : S &#215; A &#8594; &#8710;(S) and the reward function r h : S &#215; A &#8594; [0, 1] are defined at each step h (1 &#8804; h &#8804; H). At each step h, the agent starts in state s h , takes action a h , transitions to the next state s h+1 according to the transition kernel P h,s h ,a h , and receives a reward r h (s h , a h ). This process terminates after H steps when the agent reaches state s H+1 .</p><p>A policy &#960; = {&#960; h } H h=1 defines the strategy for selecting actions in different states, where &#960; h : S &#8594; &#8710;(A) specifies the probability distribution over actions at step h. The performance of an agent following a policy &#960; is measured by the value function V &#960;,P = {V &#960;,P h } H h=1 , where</p><p>The expectation is taken over the trajectory {s h , a h , r h } H h=1 generated by executing the policy &#960; and transitioning according to the transition kernel P : a h &#8764; &#960; h (s h ) and s h+1 &#8764; P h,s h ,a h .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">INFINITE-HORIZON MDP</head><p>An infinite-horizon MDP is defined as M = S, A, P, r, &#947; , where both the transition kernel P and the reward function r are stationary and do not change over time. The discount factor &#947; &lt; 1 ensures the finiteness of the accumulated reward over an infinite horizon.</p><p>Due to its stationary nature, it suffices to consider only stationary policies &#960; : S &#8594; &#8710;(A), which specify the action-selection probabilities over the action space. The value function V &#960;,P of a policy &#960; with transition kernel P is defined as</p><p>(2)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">ROBUST MDP</head><p>A finite-horizon robust MDP (RMDP) is specified by S, A, H, P = {P h }, r , and an infinitehorizon RMDP is denoted by S, A, P, r, &#947; , where P is a set containing some transition kernels, named the uncertainty set. At each step, the environment transitions to the next state following an arbitrary kernel belonging to the uncertainty set, instead of a fixed one as in non-robust MDPs. In this paper, we consider the (s, a)-rectangular uncertainty set <ref type="bibr">(Wiesemann et al., 2013)</ref>, where P is independently defined for each state-action pair, with denoting the Cartesian product:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">FORMULATION: ENHANCING ROBUSTNESS AND SCALABILITY</head><p>In this section, we develop our formulation, where we utilize RMDPs to formulate the offline RL problem against model mismatch.</p><p>In the offline setting, the dataset is collected under a fixed environment P (referred to as the nominal kernel) by executing some behavior policy &#181;. However, due to factors such as non-stationarity, unexpected perturbations, or adversarial attacks, the deployment environment may differ from P .</p><p>To account for this model deviation and improve robustness, we construct an uncertainty set by perturbing the nominal kernel and aim to learn the optimal robust policy. Specifically, following <ref type="bibr">(Xu &amp; Mannor, 2010;</ref><ref type="bibr">Xu et al., 2010;</ref><ref type="bibr">Derman et al., 2021;</ref><ref type="bibr">Kumar et al., 2023)</ref>, we define the uncertainty set (of (s, a)-pair) for modeling environmental perturbations as:</p><p>for some set Q h,s,a , Q s,a containing the possible model perturbations, and aim to learn the optimal robust policy for the corresponding RMDPs. This will not only provide an optimized lower bound on performance when the deployment environment lies within the uncertainty set, but also improves the robustness to model uncertainty <ref type="bibr">(Pinto et al., 2017)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">FINITE-HORIZON</head><p>In the finite-horizon setting, the dataset D consists of K episodes each of length H. These episodes are independently generated based on a certain behavior policy &#181; and the nominal kernel P :</p><p>where</p><p>i , and the initial state s k 1 &#8764; &#961;. Since the dataset is collected by a fixed policy under a single nominal environment, there exists a distribution shift between the data distribution, and the distribution induced by the optimal policy and the worst-case kernel. To guarantee that a provable efficient algorithm can be designed based on the dataset, we adopt a popular assumption on the distributional mismatch between the dataset distribution and the occupancy measure induced by the optimal policy &#960; * , as in <ref type="bibr">(Shi &amp; Chi, 2022)</ref>.</p><p>Assumption 1 (Robust single-policy concentrability). The behavior policy &#181; satisfies that</p><p>where d &#960; P,h is the occupancy distribution induced by policy &#960; and transition kernel P at step h.</p><p>In Assumption 1, we only require that the dataset covers the state-action pairs that are visited by the optimal policy, known as the partial coverage condition <ref type="bibr">(Rashidinejad et al., 2021)</ref>.</p><p>Our goal is then to learn an &#1013;-optimal policy &#960; for the RMDP with the uncertainty set defined as in equation 3 and equation 7, such that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">INFINITE-HORIZON</head><p>In the infinite-horizon setting, the offline dataset contains a single trajectory of length T obtained by executing a behavior policy &#181; under the nominal kernel P :</p><p>where s 1 &#8764; &#961;, a i &#8764; &#181;(&#8226;|s i ) and s i+1 &#8764; P si,ai . For the infinite horizon setting, we adopt the following two assumptions on the behavior policy.</p><p>We first adopt the partial coverage assumption in <ref type="bibr">(Blanchet et al., 2023;</ref><ref type="bibr">Wang et al., 2024c)</ref>.</p><p>Assumption 2. The behavior policy &#181; satisfies</p><p>where d &#960; P denotes the occupancy distribution induced by policy &#960; and transition kernel P .</p><p>We make an additional assumption on the behavior policy as follows.</p><p>Assumption 3. The behavior policy &#181; is stationary, and the induced Markov chain under the nominal kernel is uniformly ergodic.</p><p>Remark 1. This assumption is commonly adopted in prior works <ref type="bibr">(Wang et al., 2020;</ref><ref type="bibr">Yan et al., 2022;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Wang &amp; Zou, 2020)</ref>, as it ensures that the dataset includes all state-action pairs covered by the behavior policy, provided the dataset size exceeds a certain threshold. This assumption is required since the dataset consists of a single Markovian trajectory. When the dataset contains i.i.d. samples from the occupancy distribution d &#181; P , as in <ref type="bibr">(Wang et al., 2024c;</ref><ref type="bibr">Li et al., 2022)</ref>, such an assumption can be removed.</p><p>Our goal is then to find an &#1013;-optimal policy &#960; through D for the RMDP with the uncertainty set defined in equation 3 and equation 8, such that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DOUBLE-PESSIMISM PRINCIPLE</head><p>In this section, we introduce our model-free algorithm for learning an optimal robust policy from an offline dataset. As we mentioned, two major challenges in offline RL are the two sources of uncertainty: one arising from the limited and under-explored dataset, and the other from the mismatch between the data collection and target environments. We aim to develop a unified double-pessimism principle to address both of them.</p><p>As suggested by previous studies on offline <ref type="bibr">RL, e.g., (Rashidinejad et al., 2021;</ref><ref type="bibr">Li et al., 2022;</ref><ref type="bibr">Shi et al., 2022;</ref><ref type="bibr">Yan et al., 2022;</ref><ref type="bibr">Wang et al., 2024c)</ref>, the uncertainty arising from the dataset can be addressed using a single-pessimism principle. This involves introducing a penalty term b n , which depends on the visitation frequency of each state-action pair, to penalize less frequently visited pairs. By doing so, we obtain a conservative estimate of the value function under the nominal kernel.</p><p>However, addressing the uncertainty arising from model mismatch is particularly challenging, especially with a model-free approach. Most previous robust RL studies require that the estimation of the worst-case transition, &#963; P (V ) &#8796; min p&#8712;P pV , be unbiased. This can be satisfied when the agent can freely generate data as needed (e.g., <ref type="bibr">(Wang et al., 2023d;</ref><ref type="bibr">Liu et al., 2022b;</ref><ref type="bibr">Wang et al., 2023c;</ref><ref type="bibr">b)</ref>), yet is impractical in offline settings. To address this issue, we argue that another pessimism principle can be adopted, and that learning a policy robust to model mismatch does not require an unbiased estimator or an accurate solution to the worst-case. As long as the estimator provides a (not too pessimistic) lower bound on the worst-case, it is sufficient to account for the uncertainty due to model mismatch and still learn a robust policy. We therefore propose a model-free estimator that lower bounds &#963;(V ) to produce a conservative estimation as follows.</p><p>Definition 1. For the uncertainty set P s,a , a function &#954; is referred to as a model-mismatch penalty function if for any non-negative vector V and a sample s &#8242; &#8764; P s,a from the nominal kernel,</p><p>A universal design of the penalty function &#954; is provided in Appendix C. Such a penalty function ensures that at each step, the updated estimate represents a lower bound on the true worst-case scenario, resulting in a conservative estimation and enhancing robustness. We note that this additional pessimism may result in a more conservative policy, as the algorithm will estimate the robust value function more pessimistically. However, we argue that as long as the pessimism level is not too large, the learned policy will not be too conservative, maintaining a satisfactory performance and enhancing the robustness. More importantly, calculation of &#954; does not require any information of the model, but can be done in a data-driven and model-free fashion.</p><p>We then combine the two pessimism principles together, to develop our double-pessimism algorithm based on the Q-learning algorithm. For each sample (s, a, s &#8242; ), we update the Q table by</p><p>As we will show later, such an update rule incorporating the double-pessimism principle ensures that our estimation is conservative, and can effectively tackle the uncertainty in offline robust RL. More importantly, such an update rule does not require any information on the transition model, and hence can be adapted in a model-free manner and is more suitable for large-scale problems.</p><p>Based on this, we develop our model-free offline algorithms for both finite and infinite horizon cases.</p><p>In the following sections, we present these algorithms and develop their sample complexity analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">DOUBLE-PESSIMISM Q-LEARNING FOR FINITE-HORIZON MDPS</head><p>Adopting the double-pessimism principle, we propose our algorithm for finite-horizon MDPs.</p><p>Algorithm 1 Double-Pessimism Q-Learning for finite-horizon RMDPs.</p><p>Input: D, target success probability 1 -&#948;, uncertainty set radius R, penalty function &#954; Initialize:</p><p>In our algorithm, the term &#954; is for conservative estimation of the worst-case performance within the uncertainty set, while the term b addresses the pessimism of the limited dataset. We track the visitation count of each state-action pair and construct the penalty term b based on these counts. As the dataset visits a pair more frequently, the associated uncertainty decreases and b decreases. Remark 2. Our algorithm design is universal and works for any uncertainty set models, as long as we have a penalty function &#954; satisfying equation 15, which is provided in Appendix C. However, since &#954; for different models requires individual studies, we mainly derive our theoretical analysis for the l &#945; -norm models <ref type="bibr">(Kumar et al., 2023;</ref><ref type="bibr">Derman et al., 2021)</ref>:</p><p>We again emphasize that our double-pessimism principle and algorithm design can be extended further to other uncertainty set models. We provide a detailed discussion on &#954; in Appendix C.</p><p>Next, we develop our theoretical results for l &#945; -norm sets. We first show that equation 15 is satisfied by our design, and the algorithm results in a conservative estimation of the robust value function. Lemma 1. For the l &#945; -norm uncertainty set, set the penalty function &#954; as</p><p>where &#946; = 1 1-1 &#945; is the H&#246;lder conjugate of &#945;, and e = (1, 1, ..., 1) &#8712; R S . Then, equation 15 is satisfied. Moreover, it holds that for all (k, h, s)</p><p>The lemma provides a concrete construction of the penalty function for the l &#945; -norm model. More importantly, our model-free estimator and algorithm result in pessimistic estimations of robust value functions, tackling both uncertainty sources. In our next result, we show that our double-pessimism principle is effective in learning the optimal robust policy from the mismatched offline dataset. Theorem 2. For the l &#945; -norm uncertainty set, and any &#948; &#8712; (0, 1), suppose that the behavior policy &#181; satisfies Assumption 1. When T &#8796; HK &gt; &#213;(SC &#8902; ), the policy &#960; returned by Algorithm 1 satisfies</p><p>with probability at least 1 -&#948;. Here, &#960; * is the optimal robust policy w.r.t. a (possibly) relaxed l &#945; -norm uncertainty set (see Appendix C.2 for detailed discussion).</p><p>) k for some constants C &gt; 0 and k &#8805; 0, when T is sufficiently large.</p><p>Our algorithm is the first model-free algorithm for offline RL under model mismatch with suboptimality gap analysis. The sub-optimality gap we obtain in the previous result further implies that we can learn an &#1013;-optimal policy as long as the size of the offline dataset T exceeds</p><p>Note that in the sample complexity, the second term, referred to as the burn-in cost, is a universal constant that does not depend on &#1013;, while the first term asymptotically depends on &#1013;. When &#1013; becomes smaller, the first term dominates the overall complexity, resulting in an asymptotic complexity of</p><p>. A more detailed discussion of the complexity will be provided in Section 7.</p><p>Remark 3. When the radius R is small, it holds that E[V (s &#8242; ) -&#954; s,a (V )] = &#963; Ps,a (V ) (see Theorem 1 in <ref type="bibr">(Kumar et al., 2023)</ref>), hence Algorithm 1 converges to the optimal robust policy w.r.t. the original uncertainty set. For general uncertainty set and corresponding penalty function &#954;, Algorithm 1 may converge to the optimal robust policy w.r.t. a relaxed uncertainty set, as the estimation may be inaccurate. However, robustness can still be enhanced due to the additional pessimism. See Appendix C for further discussion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">DOUBLE-PESSIMISM Q-LEARNING FOR INFINITE-HORIZON MDPS</head><p>In this section, we present our algorithm design and analysis for offline RL with infinite-horizon MDPs. Due to space limitation and similarities in algorithm design, the algorithm is deferred to Algorithm 3 in Appendix E.1. The algorithm follows a similar design as the finite-horizon one, where the two terms &#954; and b represent conservative penalties for the double-pessimism principle. Again, our algorithm design is universal, but we develop the sample complexity results only for l &#945; -norm models. Theorem 3. Consider the l &#945; -norm uncertainty set and any &#948; &#8712; (0, 1). Suppose that the behavior policy &#181; satisfies Assumption 2 and Assumption 3. Then, the policy &#960; returned by Algorithm 3 satisfies</p><p>with probability at least 1 -&#948;.</p><p>An &#1013;-optimal robust policy can be learned as long as the size of the offline dataset exceeds</p><p>This sample complexity matches the results of model-free offline non-robust RL <ref type="bibr">(Yan et al., 2022)</ref> without variance reduction techniques, which implies the near-optimality of our method. Compared to model-based offline robust RL <ref type="bibr">(Shi &amp; Chi, 2022;</ref><ref type="bibr">Blanchet et al., 2023)</ref>, our result matches theirs in terms of C * , S, &#1013;, but exhibits a higher order dependence on (1 -&#947;). We argue that, in general, model-free algorithms tend to have lower memory requirements but incur higher sample complexity compared to model-based approaches. A more detailed discussion will be provided in Section 7.</p><p>7 RELATED WORK</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">COMPARISON WITH PRIOR ARTS</head><p>In this section, we compare our work to the most closely related studies for tabular offline robust RL <ref type="bibr">(Shi &amp; Chi, 2022;</ref><ref type="bibr">Blanchet et al., 2023)</ref>. The results are summarized in Table <ref type="table">1</ref>, where we only include the infinite horizon ones. Compared to previous studies, our method offers improved memory and computational complexity, while maintaining comparable sample complexity.</p><p>Reference Memory complexity Sample complexity Computational complexity</p><p>Polynomial Table <ref type="table">1</ref>: Comparison with offline robust RL works. <ref type="bibr">(Shi &amp; Chi, 2022</ref>) is for the KL-divergence set.</p><p>First, both related works are model-based, which involves estimating and storing the transition model { Ps,a : (s, a) &#8712; S &#215; A} &#8712; R S 2 A . This approach thus requires an additional memory of size O(S 2 A) to store the model, along with O(SA) space for the number of visited state-action pairs from the dataset. As a result, it becomes inefficient for large-scale problems or environments with complicated transition dynamics. In contrast, our model-free algorithm only requires O(SA)-sized space for the number of visits. Such a reduced memory complexity enables our model-free algorithms to handle large-scale problems, scaling effectively to large-scale or even continuous problems.</p><p>In terms of computational complexity, the most related work <ref type="bibr">(Blanchet et al., 2023)</ref> requires to solve a non-rectangular RMDP, which is generally NP-hard <ref type="bibr">(Wiesemann et al., 2013)</ref>. In contrast, our algorithm can be effectively implemented in polynomial time, which is much more practical. Compared to <ref type="bibr">(Shi &amp; Chi, 2022)</ref>, our algorithm still enjoys lower computational complexity, since the update rule of the model-based approach requires computing the inner product Ps,a V , whereas our model-free approach eliminates this computation and only requires a single vector entry V (s &#8242; ). See Appendix C for a more detailed discussion.</p><p>In terms of sample complexity, both of our sample complexity results match the ones for offline non-robust Q-learning without variance reduction, illustrating our data efficiency and near-optimality. Our result improves the dependence on S compared to <ref type="bibr">(Blanchet et al., 2023)</ref> under the l &#8734; -norm uncertainty set, showing the enhanced scalability to large-scale problems. On the other hand, it is the general observation that model-based methods tend to demonstrate better sample complexity in terms of (1 -&#947;) than model-free methods, especially when additional techniques like variance reduction are not employed. Such findings have been widely noted in various settings, for instance, when comparing robust RL with generative models ( <ref type="bibr">(Wang et al., 2024a)</ref> vs. <ref type="bibr">(Shi et al., 2023)</ref>) and non-robust offline RL ( <ref type="bibr">(Yan et al., 2022)</ref> vs. <ref type="bibr">(Li et al., 2022)</ref>).</p><p>On a side note, we note that our result for the finite-horizon setting exhibits a higher-order dependence on H (where we set H = 1 1-&#947; as the effective horizon in infinite setting). This is due to the nonstationary environment inherent in the finite-horizon setting, which is also consistent with findings from previous studies, such as in <ref type="bibr">(Shi &amp; Chi, 2022)</ref>.</p><p>To summarize, our approach addresses existing gaps in offline RL by enhancing robustness to model mismatch, reducing memory requirements, and providing adaptability to large-scale problems, establishing a state-of-the-art method in the field.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">OTHER RELATED WORKS</head><p>Offline RL without model mismatch. A significant body of offline RL works assumes identical collection and deployment environments. Based on that, many early works further rely on the global coverage assumption, where the behavior policy covers all state-action pairs <ref type="bibr">(Scherrer, 2014;</ref><ref type="bibr">Chen &amp; Jiang, 2019;</ref><ref type="bibr">Munos, 2005;</ref><ref type="bibr">Yin et al., 2021b;</ref><ref type="bibr">Yin &amp; Wang, 2021a;</ref><ref type="bibr">Jiang, 2019;</ref><ref type="bibr">Wang et al., 2019;</ref><ref type="bibr">Liao et al., 2020;</ref><ref type="bibr">Liu et al., 2019;</ref><ref type="bibr">Zhang et al., 2020;</ref><ref type="bibr">Uehara et al., 2020;</ref><ref type="bibr">Duan et al., 2020;</ref><ref type="bibr">Xie &amp; Jiang, 2020;</ref><ref type="bibr">Levine et al., 2020;</ref><ref type="bibr">Antos et al., 2007;</ref><ref type="bibr">Farahmand et al., 2010)</ref>. This assumption is often too restrictive and unrealistic, as it requires complete coverage of state-action pairs in historical data <ref type="bibr">(Gulcehre et al., 2020;</ref><ref type="bibr">Agarwal et al., 2020a;</ref><ref type="bibr">Fu et al., 2020)</ref>. A more practical partial coverage setting is later proposed, allowing to learn from a less explored dataset. Under partial coverage, the optimal policy can still be learned by incorporating the pessimism principle to handle dataset uncertainty <ref type="bibr">(Jin et al., 2021;</ref><ref type="bibr">Uehara &amp; Sun, 2021;</ref><ref type="bibr">Xie et al., 2021a;</ref><ref type="bibr">b;</ref><ref type="bibr">Rashidinejad et al., 2021;</ref><ref type="bibr">Zanette et al., 2021;</ref><ref type="bibr">Yin &amp; Wang, 2021b;</ref><ref type="bibr">Shi et al., 2022;</ref><ref type="bibr">Li et al., 2022;</ref><ref type="bibr">Zhan et al., 2022;</ref><ref type="bibr">Wang et al., 2023e;</ref><ref type="bibr">Kumar et al., 2020)</ref>. Differently, we consider potential model mismatches.</p><p>Robust RL. Robust RL <ref type="bibr">(Iyengar, 2005;</ref><ref type="bibr">Nilim &amp; El Ghaoui, 2004;</ref><ref type="bibr">Xu &amp; Mannor, 2010)</ref> aims to tackle the challenge of model mismatch in RL, by optimizing the worst-case performance over an uncertainty set. Existing work focuses mainly on the online setting <ref type="bibr">(Wang &amp; Zou, 2021;</ref><ref type="bibr">2022;</ref><ref type="bibr">Wang et al., 2023a;</ref><ref type="bibr">Badrinath &amp; Kalathil, 2021;</ref><ref type="bibr">Dong et al., 2022;</ref><ref type="bibr">Lu et al., 2024;</ref><ref type="bibr">Liu &amp; Xu, 2024a)</ref> or with a generative model <ref type="bibr">(Yang et al., 2021;</ref><ref type="bibr">Xu et al., 2023;</ref><ref type="bibr">Panaganti &amp; Kalathil, 2022;</ref><ref type="bibr">Shi et al., 2023;</ref><ref type="bibr">Wang et al., 2024a;</ref><ref type="bibr">b;</ref><ref type="bibr">2022)</ref>. Offline robust RL, except for the two mentioned above, either relies on strong assumptions, such as global coverage or absorbing states <ref type="bibr">(Panaganti et al., 2022;</ref><ref type="bibr">Yang et al., 2021)</ref>, or employs fitted type algorithm designs <ref type="bibr">(Yang et al., 2022;</ref><ref type="bibr">Panaganti et al., 2022;</ref><ref type="bibr">Liu et al., 2023)</ref>. More importantly, most of them are model-based, while we develop the first model-free algorithm for offline robust RL. Another line of research aims to improve robustness and scalability through function approximation <ref type="bibr">(Liu &amp; Xu, 2024b;</ref><ref type="bibr">Wang et al., 2024a;</ref><ref type="bibr">Ma et al., 2022</ref>), yet we focus on the tabular setting to develop a more fundamental understanding of offline RL. Another line of robust RL aims to optimize the performance under the environment from a corrupted dataset collected under the same environment <ref type="bibr">(Yang et al., 2023;</ref><ref type="bibr">Zhang et al., 2021b;</ref><ref type="bibr">2022)</ref>, which is different from our setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">EXPERIMENTS</head><p>We use numerical experiments to demonstrate the advantages of our framework in terms of robustness. We consider two sets of environments: simulated MDPs with controllable transition dynamics and Classic Control environments. More experiments are further provided in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">SIMULATION MDPS</head><p>We first evaluate the performance of our algorithm on the Garnet problem <ref type="bibr">(Archibald et al., 1995)</ref>, a randomly generated MDP G(a, b, c) with a states, b actions, and c branches (see Appendix A for a more detailed description). Both the nominal kernels and reward functions are generated randomly. The uncertainty set is constructed using the l &#8734; -norm, with the radius R s,a &#8712; [0.1, 0.5].</p><p>We first generate a dataset of size N from the nominal kernel and apply our double-pessimism algorithm, with the single-pessimism baseline <ref type="bibr">(Yan et al., 2022)</ref>, to learn policies. We then compute the robust value functions of the learned policies and plot the difference between these values and the optimal robust value functions, referred to as the optimality gap, in Figure <ref type="figure">1</ref>. The results are averaged over 10 times, with the maximum and minimum gaps as an envelope around the average value. The results show that our double-pessimism algorithm converges to the true optimal robust value as the dataset size increases, maintaining a lower optimality gap, while the single-pessimism approach results in a larger gap. These findings demonstrate that our double-pessimism principle significantly enhances robustness while remaining model-free and scalable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">CLASSIC CONTROL PROBLEMS</head><p>To further demonstrate the improvements in both scalability and robustness offered by our approach, we consider more complex Classic Control tasks from OpenAI Gym <ref type="bibr">(Brockman et al., 2016)</ref>, specifically MountainCar and CartPole (results are shown in Figure <ref type="figure">4</ref> in Appendix). The dynamics of these environments are indirectly controlled by their parameters, e.g., the length of the pole in CartPole, the gravity and the force in MountainCar, and it is of interest to improve the robustness against their uncertainty. Since these model mismatches are hard to model, model-based approaches become ineffective,yet our model-free method remains applicable and effective in such scenarios.</p><p>For each dataset generated under the nominal environment with the default parameters, we implemented our algorithm alongside the baseline <ref type="bibr">(Yan et al., 2022)</ref> to learn policies. To evaluate the robustness of the learned policies, we test their performance in modified environments with parameter perturbations <ref type="bibr">(Pinto et al., 2017;</ref><ref type="bibr">Wang &amp; Zou, 2021)</ref>, where we randomly perturbed these parameters within the range of [-&#964;, &#964; ] for 800 trials. As shown in Fig. <ref type="figure">2</ref>, our double-pessimism algorithm maintains a higher average performance under environment perturbations, demonstrating superior robustness, which aligns with our theoretical findings. This illustrates the enhanced robustness achieved by our approach. Moreover, given the large-scale and complex dynamics of these environments which are difficult for model-based approaches, our model-free algorithm effectively addresses these challenges, further demonstrating the scalability of our method. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">CONCLUSION</head><p>We explored offline RL with a focus on improving scalability and robustness simultaneously. We framed the problem as offline robust RL and developed a model-free algorithm to optimize the worst-case performance within an uncertainty set accounting for the possible model mismatch. To address two key challenges-uncertainty from the under explored dataset and model mismatch between data collection and deployment environments-we introduced a double-pessimism principle that conservatively estimates the agent's performance in a model-free manner. Building on this, we designed a universal model-free algorithm that eliminates the need for model estimation, adapts to various uncertainty sets, and scales to large problems. We further analyzed its performance for the widely studied l &#945; -norm uncertainty set, showing near-optimal data efficiency of our approach. Our approach significantly improves the robustness, scalability, and efficiency of offline RL compared to existing methods, pushing the boundaries of offline RL research.</p><p>Peng Liao, Zhengling Qi, and Susan Murphy. Batch policy learning in average reward Markov decision processes. arXiv preprint arXiv:2007.11771, 2020. Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural trust region/proximal policy optimization attains globally optimal policy. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2019. Ruo-Ze Liu, Zhen-Jia Pang, Zhou-Yu Meng, Wenhai Wang, Yang Yu, and Tong Lu. On efficient reinforcement learning for full-length game of starcraft ii. Journal of Artificial Intelligence Research, 75:213-260, 2022a. Xiao-Yin Liu, Xiao-Hu Zhou, Guo-Tao Li, Hao Li, Mei-Jiang Gui, Tian-Yu Xiang, De-Xing Huang, and Zeng-Guang Hou. Micro: Model-based offline reinforcement learning with a conservative bellman operator. arXiv preprint arXiv:2312.03991, 2023. Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Provably good batch reinforcement learning without great exploration. arXiv preprint arXiv:2007.08202, 2020. Zhishuai Liu and Pan Xu. Distributionally robust off-dynamics reinforcement learning: Provable efficiency with linear function approximation. arXiv preprint arXiv:2402.15399, 2024a. Zhishuai Liu and Pan Xu. Minimax optimal and computationally efficient algorithms for distributionally robust offline reinforcement learning. arXiv preprint arXiv:2403.09621, 2024b. Zijian Liu, Qinxun Bai, Jose Blanchet, Perry Dong, Wei Xu, Zhengqing Zhou, and Zhengyuan Zhou. Distributionally robust Q-learning. In International Conference on Machine Learning, pp. 13623-13643. PMLR, 2022b. Miao Lu, Han Zhong, Tong Zhang, and Jose Blanchet. Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near-optimal algorithm. arXiv preprint arXiv:2404.03578, 2024. Xiaoteng Ma, Zhipeng Liang, Jose Blanchet, Mingwen Liu, Li Xia, Jiheng Zhang, Qianchuan Zhao, and Zhengyuan Zhou. Distributionally robust offline reinforcement learning with linear function approximation. arXiv preprint arXiv:2209.06620, 2022. R&#233;mi Munos. Error bounds for approximate value iteration. In Proceedings of the National Conference on Artificial Intelligence, volume 20, pp. 1006. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. Arnab Nilim and Laurent El Ghaoui. Robustness in Markov decision problems with uncertain transition matrices. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 839-846, 2004. Kishan Panaganti and Dileep Kalathil. Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics, pp. 9582-9602. PMLR, 2022. Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, and Mohammad Ghavamzadeh. Robust reinforcement learning using offline data. arXiv preprint arXiv:2208.05129, 2022. Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning. In Proc. International Conference on Machine Learning (ICML), pp. 2817-2826. PMLR, 2017. Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34:11702-11716, 2021. Stephane Ross and J Andrew Bagnell. Agnostic system identification for model-based reinforcement learning. arXiv preprint arXiv:1203.1007, 2012. Bruno Scherrer. Approximate policy iteration schemes: A comparison. In Proc. International Conference on Machine Learning (ICML), pp. 1314-1322, 2014. A EXPERIMENTAL SETUP OF SECTION 8 A.1 GARNET PROBLEMS For simulated MDP environments, we implement Algorithm 3 on Garnet problems G(20, 30, 20), <ref type="bibr">G(30, 50, 30)</ref> and G(50, 100, 50). Here, the branch number denotes the number of states that can be achieved after taking an action. The uncertainty radius R s,a is randomly drawn from a uniform distribution ranging from 0.1 to 0.5 for all state-action pairs. The true robust expected values for the Garnet problems, over a certain state distribution, can be obtained via the model-based robust value iteration method. For each problem, we first generate a stochastic behavior policy with partial coverage over state-action pairs. To obtain a near-optimal stochastic behavior policy, we compute the Q-values for the nominal kernel, and adopt a softmax transformation to assign probabilities for all state-action pairs. The randomness (i.e., optimality) of the behavior policy is controlled via temperature parameter t b = 1. State-action pairs with probabilities P s,a &#8804; 0.03 (for G(20, 30, 20)), P s,a &#8804; 0.02 (for G(30, 50, 30)) and P s,a &#8804; 0.01 (for G(50, 100, 50)) are then excluded to achieve partial coverage. Finally, non-zero elements are re-normalized to maintain a valid probability distribution. By deploying the behavior policy on the nominal kernel, 10 datasets are generated at each dataset size from T = 1000 to T = 20000. We compared the double-pessimism method with the single-pessimism method in <ref type="bibr">(Yan et al., 2022)</ref>. We set &#947; = 0.95, C b = 1 &#215; 10 -4 and &#948; = 0.02.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 CLASSIC CONTROL PROBLEMS</head><p>Note in the Classic Control problems, the underlying uncertain environments may not be modeled using our perturbation-based uncertainty set in equation 8, but we can still implement our algorithms to enhance the robustness. We generate the dataset according to a random behavior policy, and implement Algorithm 3 with the radius R = &#964; . In our experiments, we set &#947; = 0.95, C b = 1 &#215; 10 -4 and &#948; = 0.02. After a policy is learned, we test its performance under a perturbed environment with the parameter randomly generated from [-&#964;, &#964; ] for 800 times, and plot the average performance among them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B ADDITIONAL EXPERIMENT RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 COMPARISONS IN TABULAR ENVIRONMENTS</head><p>In this section, we include additional experiment results under three simulated environments. Specifically, we consider the Frozen-Lake and Taxi environments from OpenAI Gym <ref type="bibr">(Brockman et al., 2016)</ref>, and the American Option problem <ref type="bibr">(Panaganti et al., 2022;</ref><ref type="bibr">Shi &amp; Chi, 2022;</ref><ref type="bibr">Zhou et al., 2021)</ref>.</p><p>The transition dynamics of these environments can be directly controlled, and we construct l &#8734; -norm uncertainty sets centered at their nominal kernels. Similarly, we trained our double-pessimism Q-learning together with the single-pessimism baseline, and plotted the optimality gap between the learned and optimal robust value functions. As the results in Figure <ref type="figure">3</ref> show, our double-pessimism Q-learning effectively obtains the optimal robust policy, whereas the single-pessimism Q-learning only achieves sub-optimal performance. The results hence indicate that our additional pessimism effectively enhances robustness against model uncertainty, verifying our theoretical results.</p><p>(a) Frozen-Lake (b) Taxi (c) American Option  B.2 SCALABLE ALGORITHM WITH FUNCTION APPROXIMATION: DOUBLE-PESSIMISM CQL In this section, we extend the evaluation of our double-pessimism framework to large-scale problems using function approximation techniques. The algorithms presented earlier (Algorithm 1, Algorithm 3), while model-free, are designed for tabular settings and require memory space of O(SA) for the Q-table, making them less efficient for large-scale applications. To improve scalability, replacing the Q-table with low-dimensional function approximations (e.g., neural networks) to reduce memory costs is a widely adopted approach. On the other hand, existing offline RL algorithms like Conservative Q-learning (CQL, (Kumar et al., 2020)) and Implicit Q-learning (IQL, (Kostrikov et al., 2021)), along with others (Ross &amp; Bagnell, 2012; Laroche et al., 2019; Fujimoto et al., 2019; Kumar et al., 2019; Agarwal et al., 2020b; Liu et al., 2020; Jin et al., 2021; Xie et al., 2021a; Yin et al., 2021a; Rashidinejad et al., 2021; Xie &amp; Jiang, 2021; Jiang &amp; Huang, 2020), have focused solely on offline RL without model mismatch, resulting in degraded performance when model mismatch is present.</p><p>Aiming to enhance both robustness and scalability, we design and evaluate a double-pessimism CQL algorithm, demonstrating that our framework is not limited to tabular settings but can also be integrated with function approximation or deep RL techniques, significantly improving robustness against model mismatch. Specifically, we employ the CQL method to impose pessimism on the limited dataset, and further incorporate an additional penalty term into the robust Bellman operator estimation to effectively mitigate model mismatch. Based on this construction, we can similarly design a double-pessimism CQL algorithm, from which enhanced robustness is expected. To validate the effectiveness of our double-pessimism principle, we compare our double-pessimism CQL with the vanilla single-pessimism CQL under CartPole from OpenAI Gym. The policy is trained in the nominal environment and evaluated in randomly perturbed environments (perturbation radius &#964; ) over 800 trials. The results, shown in Figure <ref type="figure">5</ref>, display the average performance as solid curves, with envelopes representing standard deviations.</p><p>As the results indicate, our double-pessimism CQL consistently outperforms the vanilla CQL in perturbed environments, demonstrating enhanced robustness. This experiment confirms the universal applicability of our double-pessimism framework in improving robustness, regardless of the specific algorithm used. It also highlights the scalability of our approach, which can be integrated with advanced deep offline RL algorithms for large-scale problems using function approximation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 ABLATION EXPERIMENTS</head><p>Our double-pessimism principle addresses two key challenges: the first component tackles the limited dataset coverage in offline RL to handle out-of-distribution issues, while the second addresses model mismatch between the data generation and deployment environments.</p><p>In this section, we conduct ablation experiments to evaluate the effectiveness of this principle. Specifically, we compare four algorithms in an offline setting: vanilla Q-learning (with zero pessimism), robust Q-learning (with model-mismatch pessimism only), offline non-robust Q-learning (with dataset pessimism only), and our proposed offline robust Q-learning (with double pessimism). The experiments are conducted on two Garnet problems, where we evaluate the robust value functions of the learned policies with respect to an uncertainty set defined by the l &#8734; -norm.</p><p>The results are shown in Figure <ref type="figure">6</ref>. The solid curve represents the average value across 10 independent runs, while the shaded area indicates the maximum and minimum values observed.</p><p>Our double-pessimism approach outperforms all four algorithms, including those with a single source of pessimism, demonstrating the effectiveness of our framework. Furthermore, the single-pessimism methods achieve better performance than the vanilla algorithm with no pessimism, highlighting the benefits of incorporating pessimism in offline robust RL. However, both are ultimately outperformed by our double-pessimism method, underscoring the importance of addressing both sources of uncertainty through the double-pessimism principle. C FURTHER DISCUSSION OF &#954; C.1 A UNIVERSAL CONSTRUCTION OF &#954;</p><p>In this section, we discuss the design of the penalty function &#954; for universal uncertainty set models defined by some distribution divergence/distance functions F (&#8226;||&#8226;):</p><p>Note that this uncertainty set includes perturbed environments within a region centered around the nominal kernel, effectively modeling environmental uncertainty in practical applications. This is because, in practice, perturbed environments should not deviate significantly from the nominal kernel and should therefore fall within a defined region.</p><p>We first present the following theorem for a universal construction of the penalty function &#954;. Theorem 4. Let &#954;(V ) be the optimal value of the following constrained problem:</p><p>Then, &#954;(V ) satisfies equation 15, i.e.,</p><p>Proof. Note that the problem in equation 25 is equivalent to the problem</p><p>The proof is then straightforward by noting that P &#8834; Q, hence</p><p>Such a result provides a universal construction of the penalty function &#954;, for the perturbed-based uncertainty set as in equation 24. Note that &#954;(V ) depends on P , which is unknown in practice, but any unbiased estimation of it is sufficient. To illustrate this and show the generality of our design, we develop a case study for the &#967; 2 -divergence uncertainty set in the following section.</p><p>C.2 CASE STUDY: l &#945; -NORM UNCERTAINTY SET</p><p>In this section, we provide a more detailed discussion on the l &#945; -norm uncertainty set. As discussed, we consider the relaxed l &#945; -norm uncertainty set:</p><p>Ps,a = {P s,a + q :</p><p>where we relax the condition P s,a + q &#8805; 0. Then the worst-case transition w.r.t. P can be derived as</p><p>where</p><p>For popular choices of &#945;, the optimization problem in equation 31 has a closed-form solution, specified in Table <ref type="table">2</ref>  <ref type="bibr">(Kumar et al., 2023)</ref>. Note that for the three choices of &#945; = 1, 2, &#8734;,</p><p>Table <ref type="table">2</ref>: Penalty term for l &#945; -norm uncertainty set the resulting penalty terms incur a computational complexity of O(S). When combined with our algorithm, this leads to an overall implementation complexity of O(SA) per step. In contrast, the model-based methods proposed in <ref type="bibr">(Shi &amp; Chi, 2022;</ref><ref type="bibr">Blanchet et al., 2023)</ref> have a computational cost of O(S 2 A) per step <ref type="bibr">(Kumar et al., 2023)</ref>, highlighting the superior efficiency and scalability of our approach.</p><p>Our algorithm and theoretical result will then characterize the convergence to the optimal robust policy w.r.t. P. More importantly, when the uncertainty radius R is small, the relaxation will not be effective, i.e., P = P <ref type="bibr">(Zhou et al., 2024)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.3 CASE STUDY: &#967; 2 UNCERTAINTY SET</head><p>We adapt the construction we obtained to the widely used &#967; 2 -divergence as a case study. The design of &#954; for other uncertainty sets can be obtained in a similar way.</p><p>Specifically, the uncertainty defined for the (s, a)-pair is</p><p>where</p><p>is the &#967; 2 -divergence. We aim to design a model-free function &#954; that serves as the penalty term to address the uncertainty from the model mismatch.</p><p>We first establish the following lemma. Lemma 5. The constrained problem</p><p>has the solution -R s,a Var Ps,a (V ).</p><p>(34)</p><p>Proof. To simplify the notation, we omit the subscript s, a from P s,a and R s,a . We note that if any entry P i = 0, then any feasible q i = 0, otherwise the &#967; 2 -divergence will be infinite. Thus, we can simply ignore the i-th entry in this case and only consider the remaining ones. Hence, we assume P i &gt; 0, &#8704;i without loss of generality.</p><p>Note that the condition</p><p>hence the Lagrangian function L of the constrained problem is</p><p>From the KKT conditions <ref type="bibr">(Bertsekas, 1997)</ref>, the solution q * and the Lagrangian multipliers &#955; * and &#181; * must satisfy</p><p>We first show that if q * is the optimal solution, then D &#967; 2 (q * + P ||P ) &lt; R. To show that this statement always holds, our claim is that there exists an optimal solution such that &#181; * = 0, then we have</p><p>and hence,</p><p>To prove that this claim is not possible, we provide a counterexample to demonstrate that &#181; * = 0 and</p><p>where p 1 , p 2 &#824; = 0. We have q 2 = -q 1 , then</p><p>Hence,</p><p>Published as a conference paper at ICLR 2025</p><p>The optimal value of the optimization problem is</p><p>Obviously, the optimization problem does not have an optimal solution, but instead an infimum.</p><p>There always exists a feasible solution q such that</p><p>which means that &#181; * &#824; = 0 always holds.</p><p>Thus,</p><p>and hence,</p><p>where we use the constraint i q * i = 0 and i</p><p>Pi = R. Again, from equation 37, we have that</p><p>and hence,</p><p>Taking the sum over i implies that</p><p>and hence,</p><p>On the other hand, note that equation 37 further implies that</p><p>and hence,</p><p>Plugging in equation 49 implies that</p><p>Hence, from equation 45, the optimal solution of the constrained problem is then -RVar P (V ), which completes the proof.</p><p>With the optimal solution to equation 33, we can then design the penalty function &#954; for the &#967; 2 uncertainty set defined as in equation 32. Firstly, we note that equation 33 is a relaxation of the support function over equation 32, therefore the optimal solution to equation 33 is not greater than &#963; P (V ), and therefore is a pessimistic penalty of the model mismatch. We thus design the penalty function as</p><p>We note that in the model-free setting, it is straightforward to obtain an unbiased estimation of &#954;, which however requires more than 1 sample. Specifically, for n i.i.d. samples (s, a, s &#8242; i ), i = 1, ..., n, the model-free penalty function is defined as</p><p>where</p><p>. Such a penalty function satisfies the condition equation 15 of the pessimism principle, and hence we can extend our model-free algorithms to the &#967; 2 -divergence model. We present the algorithm for the infinite horizon in Algorithm 2. Different from Algorithm 3, for the Algorithm 2 Double-Pessimism Q-Learning for infinite-horizon RMDPs with &#967; 2 -divergence uncertainty set.</p><p>end for &#960;(s) = arg max a&#8712;A Q T (s, a), &#8704;s Output: &#960; &#967; 2 -divergence model, we require 2 samples at each step to estimate &#954;. However, the estimation does not required any information on P s,a and hence Algorithm 3 is still model-free.</p><p>Note that generally the penalty function &#954; is biased, thus the algorithm may not converge to the optimal robust policy. However, the robustness can still be enhanced due to the additional pessimism. We validate the effectiveness of our algorithm in optimizing performance under model mismatch in an offline setting through numerical experiments. Specifically, we implemented our algorithm alongside the baseline single-pessimism Q-learning algorithm <ref type="bibr">(Yan et al., 2022)</ref> on Garnet problems with varying parameters, and three simulation environments: Frozen-Lake, Taxi, and American Option. Using datasets of different sizes, we computed the robust value function of the learned policy via dynamic programming <ref type="bibr">(Iyengar, 2005)</ref>, and plotted the results in Figure <ref type="figure">7</ref> and Figure <ref type="figure">8</ref>. Each curve represents the average over 10 independent runs, with the shaded region indicating the maximum and minimum values. As demonstrated in the results, our double-pessimism Q-learning significantly outperforms the single-pessimism approach, showcasing the robustness of our algorithm to model mismatch and confirming the efficacy of our double-pessimism design.  D ANALYSIS OF THE FINITE HORIZON SETTING D.1 NOTATION</p><p>Recall the learning rate defined by</p><p>for the n-th visit of a given state-action pair at a given time step h. We further adopt two sequences of related quantities for any integers N &#8805; 0 and n &#8805; 1 from <ref type="bibr">(Shi et al., 2022)</ref>:</p><p>It has been shown in <ref type="bibr">(Shi et al., 2022;</ref><ref type="bibr">Yan et al., 2022)</ref> that</p><p>We also introduce the following notation:</p><p>&#8226; N k h (s, a), or simply N k h : The number of episodes that have visited the state-action pair (s, a) at step h before the start of the k-th episode.</p><p>&#8226; k n h (s, a), or simply k n : The index of the episode in which the state-action pair (s, a) is visited at step h for the n-th time. We adopt the convention that k 0 = 0.</p><p>&#8226; P k h &#8712; {0, 1} 1&#215;S : A row vector corresponding to the empirical transition at step h of the k-th episode, defined as</p><p>The deterministic greedy policy at the beginning of the k-th episode.</p><p>&#8226; &#960;: The final output of the algorithm, corresponding to &#960; K+1 as defined above. For simplicity in our analysis, we treat &#960; as &#960; K , which does not affect the result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 LEMMAS FOR THEOREM 2</head><p>In this section, we present the lemmas that are utilized in the proof of Theorem 2.</p><p>The first lemma demonstrates how our choice of the penalty term &#954; can address the uncertainty arising from model mismatch.</p><p>Lemma 6. (Theorem 1 in <ref type="bibr">(Kumar et al., 2023</ref>)) Let P s,a be the uncertainty set defined using the l &#945; -norm. For any vector V , the following relationship holds:</p><p>where &#954; is defined as in equation 18.</p><p>The following lemma provides properties concerning the learning rates and is adapted from <ref type="bibr">(Jin et al., 2018;</ref><ref type="bibr">Li et al., 2021)</ref>.</p><p>Lemma 7 (Lemma 1 in <ref type="bibr">(Li et al., 2021)</ref>). For any integer N &gt; 0, the following properties hold:</p><p>The following lemmas concern the concentration properties of the sample generation.</p><p>The first lemma below is adapted from <ref type="bibr">Xie et al. (2021b, Lemma A.1)</ref>.</p><p>Lemma 8. (Lemma 8 in <ref type="bibr">(Shi et al., 2022)</ref>) Suppose N &#8764; Binomial(n, p), where n &#8805; 1 and p &#8712; [0, 1].</p><p>For any &#948; &#8712; (0, 1), we have</p><p>and</p><p>with probability at least 1 -4&#948;.</p><p>The following lemma is a standard concentration inequlity result.</p><p>Theorem 9 (Freedman's inequality <ref type="bibr">(Freedman, 1975)</ref>). Consider a filtration</p><p>and let E k stand for the expectation conditioned on</p><p>for all k &#8805; 1 for some quantity R &lt; &#8734;. We also define</p><p>In addition, suppose that W n &#8804; &#963; 2 holds deterministically for some given quantity &#963; 2 &lt; &#8734;. Then, for any positive integer m &#8805; 1, with probability at least 1 -&#948; one has</p><p>The Freedman's inequality further implies several important results related to our problem. Lemma 10.</p><p>be a collection of vectors satisfying the following properties:</p><p>h is fully determined by the samples collected up to the end of the (h -1)-th step of the i-th episode;</p><p>For any positive integer N &#8805; H, consider the following sequence:</p><p>With probability at least 1 -&#948;,</p><p>holds simultaneously for all (k, h, s, a, N ) <ref type="bibr">s,a)</ref> . From equation 61b in Lemma 7, we have</p><p>w , we can apply Lemma 7 from <ref type="bibr">(Li et al., 2021)</ref> to obtain, with probability at least 1 -&#948;,</p><p>where the final line uses equation 61b from Lemma 7 again.</p><p>be a collection of vectors satisfying the following properties:</p><p>) is fully determined by the given state-action pair (s, a) and the samples collected up to the end of the (k -1)-th episode;</p><p>For any positive C d &#8805; 0, consider the following sequences:</p><p>Consider any &#948; &#8712; (0, 1). Then with probability at least 1 -&#948;,</p><p>hold simultaneously for all h &#8712; [H].</p><p>Proof. The proof similarly follows from <ref type="bibr">(Shi et al., 2022)</ref>.</p><p>We then prove Lemma 1 showing the effectiveness of our double pessimism principle, i.e., that our estimation is a conservative estimation of the robust value function.</p><p>Lemma 12. Consider any &#948; &#8712; (0, 1), and suppose that c b &gt; 0 is some sufficiently large constant. Then, with probability at least 1 -&#948;,</p><p>holds simultaneously for all (k, h, s)</p><p>Proof. Proof of inequality equation 71. We show it by invoking Lemma 10. Let</p><p>where the first equation is from Lemma 6. Hence applying Lemma 10 implies that with probability at least 1 -&#948;,</p><p>holds simultaneously for all (s, a, k, h)</p><p>holds simultaneously for all s, a, h, k &#8712; S &#215; A &#215; [H] &#215; [K], which follows directly from the property equation 61a in Lemma 7.</p><p>Combining the above, equation 74 and equation 76 hence imply that</p><p>Proof of inequality equation 72. Note that the second inequality of equation 72 is straightforward as</p><p>s) holds for any policy &#960;. As a consequence, it suffices to establish the first inequality of equation 72:</p><p>Define</p><p>We need to verify</p><p>Step 1: base case.</p><p>Let us begin with the base case when h + 1 = H + 1 for all episodes k &#8712; [K]. Recognizing the fact that V &#960; H+1 = V k H+1 = 0 for any &#960; and any k &#8712; [K], we directly arrive at</p><p>Step 2: induction. To justify equation 80 under the induction hypothesis equation 79, we decompose the difference term to obtain</p><p>where the last line holds since V h (s) has not been updated during episodes <ref type="formula">78</ref>). We shall prove that the right-hand side of equation 82 is non-negative by discussing the following two cases separately.</p><p>Case 1. Consider the case where</p><p>holds for all (k, h)</p><p>To continue, we turn to controlling a more general term </p><p>This relation combined with equation 106 allows us to express the difference between</p><p>Here, (a) invokes the robust Bellman equation</p><p>, owing to the induction hypothesis in equation 79 as well as the monotonicity of V h+1 in Lemma 12. Consequently, it follows from equation 85 that</p><p>(86) for all state-action pair (s, a), where the last inequality holds due to the bound in equation 71 in Lemma 12. Plugging the above result into equation 84 directly establishes that</p><p>which follows from the definition of k o (h) in equation 78 and the corresponding fact in equation 83.</p><p>We also make note of the fact that</p><p>which holds since V h (s) (and hence &#960; h (s)) has not been updated during episodes k o (h), k o (h) + 1, &#8226; &#8226; &#8226; , k -1 (in view of the definition equation 78). Combining the above two results, we can show that</p><p>where the final line can be verified using exactly the same argument as in the previous case to show equation 85 and then equation 87. Here, we omit the proof of this step for brevity.</p><p>To conclude, substituting the relations equation 87 and equation 90 in the above two cases back into equation 82, we arrive at</p><p>&#8805; 0 as desired in equation 80. This immediately completes the induction argument.</p><p>Lemma 13. With probability at least 1 -&#948;, it holds that</p><p>Proof. It is sufficient to show that</p><p>Define two auxiliary sequences {Y h,k } K k=1 and {Z h,k } K k=1 which are the empirical estimates of A h,k and B h,k , respectively. For any time step h in episode k, Y h,k and Z h,k are defined as follows</p><p>Note that</p><p>Here, (a) holds by replacing k n (s k h , a k h ) with l and gathering all terms that involve V &#8902; h+1 -V l h+1 ; in the last line, we have invoked the property</p><p>) together with the fact V &#8902; h+1 -V l h+1 &#8805; 0 (see Lemma 12), and have further replaced l with k. With the above relation in hand, in order to verify equation 93, we further decompose A h into several terms</p><p>where (a) follows from equation 94.</p><p>As a result, it remains to control</p><p>) separately in the following.</p><p>Step 1: controlling K k=1 (A h,k -Y h,k ). We shall first control this term by means of Lemma 11. Specifically, consider</p><p>which satisfies</p><p>Here, we use the fact that &#951;</p><p>N k h n = 1 (see equation 56 and equation 58). Then, applying Lemma 11 with equation 96, we have with probability at least 1 -&#948;, the following inequality holds true</p><p>where the last inequality is from</p><p>Step 2: controlling</p><p>and let us consider</p><p>(101) Similarly, in view of Lemma 11, we can show that with probability at least 1 -&#948;,</p><p>Step 3: putting all this together. Substitution results in equation 98 and equation 102 back into equation 95 completes the proof of equation 93 as follows</p><p>This hence completes the proof.</p><p>Lemma 14.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Denote the term</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>K k=1</head><p>(s,a)&#8712;S&#215;A d &#960; &#8902; P,h (s, a)&#951;</p><p>b n by I h . Consider any &#948; &#8712; (0, 1). With probability at least 1 -&#948;, we have</p><p>where we recall that &#953; := log SAT &#948; .</p><p>Proof. The proof can be obtained by directly following the proof in <ref type="bibr">(Shi et al., 2022)</ref>, and is hence omitted here.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 PROOF OF THEOREM 2</head><p>We then proceed to the proof. Theorem 15. (Restatement of Theorem 2) Consider any &#948; &#8712; (0, 1). Suppose that the behavior policy &#181; satisfies Assumption 1. There exists some universal constant c a , such that if we set &#953; := log SAT &#948; and set T &gt; SC &#8902; &#953;, then the policy &#960; returned by Algorithm 1 satisfies</p><p>with probability at least 1 -&#948;.</p><p>Proof. For any state-action pair (s, a), according to the update rule specified in Algorithm 1, we have</p><p>where the first identity holds because k N k h denotes the most recent episode before k that visits (s, a) at step h, and the learning rate is defined as in equation 55. Note that k &gt; k N k h always holds. Applying the above relation recursively and using the notation defined in equation 56, we obtain</p><p>Applying Lemma 12, the optimality gap term equation 104 can be decomposed as follows</p><p>where (a) follows from Lemma 12 (i.e., V &#960; K 1 (s) &#8805; V K 1 (s) for all s &#8712; S), (b) results from the monotonicity property in Lemma 12, and the final equality holds because d &#960; &#8902; 1 (s) = &#961;(s). We then bound the right-hand side of equation 107. Since &#960; &#8902; is a deterministic policy, d &#960; &#8902; P,h (s) = d &#960; &#8902; P,h (s, &#960; &#8902; (s)). And from the fact that</p><p>for any h &#8712; [H], where the last identity holds because</p><p>To further bound the term <ref type="bibr">a)</ref> in equation 108, we first adapt equation 58 and have that</p><p>where the second line follows from the robust Bellman's optimality equation. Combining equation 106 and equation 110 implies that</p><p>where (a) is from Lemma 6 and the definition of</p><p>), and the last inequality follows from the fact</p><p>and equation 71 in Lemma 12. Plug equation 112 in equation 108, we have that K k=1 s&#8712;S</p><p>We then bound the last term on the right-hand side of equation 113. By applying Lemma 13, it implies that K k=1 s&#8712;S</p><p>Recursively applying equation 114 over the time steps</p><p>Finally, to bound the right-hand side of equation 115, we combine Lemma 14 and equation 107, which yields</p><p>for some sufficiently large constant c a &gt; 0, where the last inequality is valid as long as T &gt; SC &#8902; &#953;.</p><p>This hence completes the proof of Theorem 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E ANALYSIS OF THE INFINITE HORIZON SETTING E.1 ALGORITHM FOR INFINITE HORIZON</head><p>In this section, we present the analysis of the infinite horizon robust MDPs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.2 NOTATION</head><p>The notation used in the proof for the infinite horizon setting is largely similar to that used in the finite horizon case. For any state s &#8712; S and action a &#8712; A, we define:</p><p>to be the (s, a)-th row of a probability transition matrix P &#8712; R SA&#215;S .</p><p>Algorithm 3 Double-Pessimism Q-Learning for infinite-horizon RMDPs.</p><p>Input: D, target success probability 1 -&#948;, uncertainty set radius R, &#915; = 4 1-&#947; log ST &#948; , penalty function &#954; Initialize: Q 0 (s, a) = 0, V 0 (s) = 0, n 0 (s, a) = 0, &#8704;s, a for t = 1, ..., T do Sample a sample (s t-1 , a t-1 , s t ) from D n t (s t-1 , a t-1 ) &#8592; n t-1 (s t-1 , a t-1 ) + 1; n t (s, a)</p><p>For any t &#8805; 0, we define P t &#8712; R SA&#215;S to be an empirical probability transition matrix, given by:</p><p>for all s, s &#8242; &#8712; S and a &#8712; A.</p><p>For any deterministic policy &#960;, we introduce two probability transition kernels: P &#960; : S &#8594; &#8710;(S) and P &#960; : S &#215; A &#8594; &#8710;(S &#215; A), defined as follows:</p><p>for any (s, a), (s &#8242; , a &#8242; ) &#8712; S &#215; A.</p><p>Additionally, we define &#961; &#960; &#8902; to be a distribution over S &#215; A such that:</p><p>For any sequence {a i } n2 i=n1 and two integers m 1 and m 2 , we define:</p><p>Lemma 16. (Lemma 4.1 in <ref type="bibr">(Jin et al., 2018)</ref>, Lemma 1 in <ref type="bibr">(Li et al., 2021)</ref>) Recall the learning rates are</p><p>where &#951; j = (&#915; + 1)/(&#915; + j). Then 1. For any integer t &#8805; 1, t i=1 &#951; t i = 1 and &#951; t 0 = 0. 2. For any integer t &#8805; 1 and any 1/2 &#8804; a &#8804; 1,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">For any integer</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">For any integer</head><p>We then present the following lemma to establish an upper bound on Q &#8902; -Q t , and simultaneously justify that the value function estimate V t is always a pessimistic view of V &#960;t (and hence V &#8902; ). Lemma 17. With probability exceeding 1 -&#948;, for all s &#8712; S and t &#8712; [T ], it holds that</p><p>where n = n t (s, &#960; &#8902; (s)) and we define</p><p>in addition, we also have</p><p>Proof. Proof of equation 121. Consider any given pair (s, a) &#8712; S &#215; A and denote n = n t (s, a), the total number of times that (s, a) has been visited prior to time t. Set k 0 = -1, and let</p><p>for each 1 &#8804; i &#8804; T . Clearly, each k i is a stopping time. In view of the update rule, we have</p><p>which together with the robust Bellman optimality equation gives</p><p>the proof of this claim ( <ref type="formula">127</ref>) is deferred to later. As a consequence, for every s &#8712; S and t &#8712; [T ], there exists j(t) &#8712; [t] such that</p><p>&#8805; min &#947; &#963; s,&#960; j(t) (s) (V &#960;t ) -&#963; s,&#960; j(t) (s) (V t ) , 0 .</p><p>Here, (a) and (b) hold since the update rule asserts that there must exist some j(t) &#8804; t such that</p><p>) and &#960; t (s) = &#960; j(t) (s); (c) utilizes (127); and (d) follows from the monotonicity of V t in t (by construction). By setting</p><p>we can deduce that</p><p>which together with the assumption 0 &lt; &#947; &lt; 1 immediately gives</p><p>for every s &#8712; S, we conclude the proof.</p><p>Now we show equation 127. First of all, if n t s, &#960; t (s) = 0, then for all j &#8712; [t], Q j s, &#960; t (s) = 0 since it is never updated; therefore, (127) holds true. From now on, we shall only focus on the case when n t s, &#960; t (s) &#8805; 1.</p><p>Consider any s &#8712; S, t &#8712; [T ] and j &#8712; [t]. For the moment, let us define {k i } T i=1 w.r.t. the state-action pair s, &#960; t (s) in the same way as (123). We can then repeat the argument in (124) to decompose</p><p>Lemma 18. (Lemma 3 in <ref type="bibr">(Yan et al., 2022)</ref>)With probability exceeding 1 -&#948;, we have</p><p>Lemma 19. (Lemma 5 in <ref type="bibr">(Yan et al., 2022)</ref>) We can construct an auxiliary set of random variables s i k , a i k : 1 &#8804; k &#8804; K -1 satisfying</p><p>&#8764; &#181; b , (130a)</p><p>and s i k , a i k is independent of (s t , a t ) : 0 &#8804; t &#8804; (k -1) &#964; + i .</p><p>Lemma 20. (Lemma 4 in <ref type="bibr">(Yan et al., 2022)</ref>) Let &#915; = 4 1-&#947; log ST &#948; for some 0 &lt; &#948; &lt; 1. For any vector with non-negative entries V &#8712; R d , we have</p><p>E.4 PROOF OF THEOREM 3</p><p>Following <ref type="bibr">(Yan et al., 2022)</ref>, we similarly define the following terms first:</p><p>&#952; j := &#947; 1 + 1 &#915; 3 j T t=1 s&#8712;S &#961;(P &#960; &#8902; ) j s, &#960; &#8902; (s) min &#946; nt(s,&#960; &#8902; (s)) s, &#960; &#8902; (s) , 1 1 -&#947; , &#958; j := &#947; 1 + 1 &#915; 3 j t mix (&#948;) t=1 &#961;(P &#960; &#8902; ) j , V &#8902; -V t + &#947; 1 + 1 &#915; 3 j+1 &#961;(P &#960; &#8902; ) j+1 , V &#8902; -V 0 , &#968; j := &#947; 1 + 1 &#915;</p><p>3 j T t=t mix (&#948;) s&#8712;S,a&#8712;A &#961; &#960; &#8902; (P &#960; &#8902; ) j (s, a) nt(s,a) i=1 &#951; nt(s,a) i P s,a V &#8902; -V ki(s,a) -1 + 1 &#915; &#961; &#960; &#8902; (P &#960; &#8902; ) j (s t , a t ) &#181; b (s t , a t ) nt(st,at) i=1 &#951; nt(st,at) i P st,at V &#8902; -V ki(st,at) , &#981; j := &#947; j+1 1 + 1 &#915; 3j+2 T t=0 1 (st,at)&#8712;I &#961; &#960; &#8902; (P &#960; &#8902; ) j (s t , a t ) &#181; b (s t , a t ) P st,at (V &#8902; -V t ) -1 + 1 &#915; s&#8712;S,a&#8712;A</p><p>where we recall the definition of I in equation 129.</p><p>We then proceed to the proof. Theorem 21. (Restatement of Theorem 3) Consider any &#948; &#8712; (0, 1). Suppose that the behavior policy &#181; satisfies Assumption 2. The policy &#960; returned by Algorithm 3 satisfies</p><p>with probability at least 1 -&#948;.</p><p>Proof. Note that</p><p>Here, (a) holds true according to Lemma 17; (b) follows from the monotonicity of V t in t (by construction); and (c) follows simply from the definition of &#945; 0 . We then turn attention to bounding &#945; 0 , towards which we observe that .</p><p>Here, the first identity holds since V &#8902; (s) = Q &#8902; s, &#960; &#8902; (s) and 0 &#8804; V &#8902; (s) -V t &#8804; 1/(1 -&#947;) for all s &#8712; S, the second line relies on the fact that V t (s) &#8805; max a Q t (s, a) &#8805; Q t (s, &#960; &#8902; (s)), while the last line invokes Lemma 17. With probability exceeding 1 -&#948;, the first term &#950; can be upper bounded by 1{(s t , a t ) &#8712; I} &#961; &#960; &#8902; (s t , a t ) &#181; b (s t , a t ) nt(st,at) i=1 &#951; nt(st,at) i P st,at V &#8902; -V ki(st,at) + &#968; 0 + (1 + R) t mix (&#948;) t=1 &#10216;&#961;, V &#8902; -V t &#10217; + &#952; 0 (b) &#8781; &#947; 1 + 1 &#915; T t=t mix (&#948;) 1{(s t , a t ) &#8712; I} &#961; &#960; &#8902; (s t , a t ) &#181; b (s t , a t ) &#63723; &#63725; n T (st,at) j=nt(st,at) &#951; j nt(st,at) &#63734; &#63736; P st,at (V &#8902; -V t ) + &#968; 0 + (1 + R) t mix (&#948;) t=1 &#10216;&#961;, V &#8902; -V t &#10217; + &#952; 0 (c) &#8804; &#947; 1 + 1 &#915; 2 T t=0 1{(s t , a t ) &#8712; I} &#961; &#960; &#8902; (s t , a t ) &#181; b (s t , a t ) P st,at (V &#8902; -V t ) + &#968; 0</p><p>where we remind the reader of our notation &#961; &#960; &#8902; in equation 119. Here, (a) is valid (i.e., &#961;(s t , a t )/&#181; b (s, a) is well defined for t &#8805; t mix (&#948;)) due to Lemma 18; (b) holds by grouping the terms in the previous line; and (c) utilizes Lemma 16 and the property that V &#8902; &#8805; V t (cf. Lemma 17). Therefore, we arrive at</p><p>where we have used the definition of &#958; 0 . Repeat the same argument to reach &#945; j &#8804; &#945; j+1 + &#958; j + &#952; j + &#968; j + &#981; j for all j &#8805; 1. This in turn allows us to conclude that &#945; 0 &#8804; lim sup</p><p>j&#8594;&#8734; &#945; j =: &#945; + &#8734; j=0 &#958; j =: &#958; + &#8734; j=0 &#952; j =: &#952; + &#8734; j=0 &#968; j =: &#968; + &#8734; j=0 &#981; j =: &#981; .</p><p>We will then bound the terms &#945;, &#958;, &#952;, &#968; and &#981; separately in the subsequent steps. Our proofs are similar to the ones in <ref type="bibr">(Yan et al., 2022)</ref>, hence we omit the repeated part.</p><p>Bounding &#945;. The bound is similar to <ref type="bibr">(Yan et al., 2022)</ref>. It is first observed that &#945; = lim sup</p><p>Bounding &#958;.</p><p>By utilizing (131), it holds that</p><p>Bounding &#952;. Following <ref type="bibr">(Yan et al., 2022)</ref>, we have that Note that for any (s, a) &#8712; S &#215; A. Note that this equation exactly matches with Step 2.4 in <ref type="bibr">(Yan et al., 2022)</ref>, hence the remaining proof similarly follows, and is omitted here. Specifically, we have that</p><p>Bounding &#981;. Similar to <ref type="bibr">(Yan et al., 2022)</ref>, we can employ an analogous argument to show that &#981; can be bounded as</p><p>Now, plugging the bounds on &#945;, &#952;, &#968; and &#981; further implies that</p><p>(1 -&#947;) 3 log 2 T &#948; .</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>for any (h, k, s) &#8712; [H] &#215; [K] &#215; S, which denotes the index of the latest episode -before the end of the (k -1)-th episode -in which V h (s) has been updated. We abbreviate k o (h, k, s) as k o (h) whenever it is clear from the context.We utilize an induction approach to show that. Assume thatV k &#8242; &#915; (s) &#8804; V &#960; k &#8242; &#915; (s) for all (k &#8242; , &#915;, s) &#8712; [k -1] &#215; [H + 1] &#215; S,(79a)</p></note>
		</body>
		</text>
</TEI>
