<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Distributionally Robust Q-Learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 3rd Quarter (CY)</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10413185</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zijian Liu</author><author>Qinxun Bai</author><author>Jose H. Blanchet</author><author>Perry Dong</author><author>Wei Xu</author><author>Zhengqing Zhou</author><author>Zhengyuan Zhou</author><author>Kamalika Chaudhuri</author><author>Stefanie Jegelka</author><author>Le Song</author><author>Csaba Szepesvari</author><author>Gang Niu</author><author>Sivan Sabato</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Reinforcement learning (RL) has demonstrated remarkable achievements in simulated environments. However, carrying this success to real environments requires the important attribute of robustness, which the existing RL algorithms often lack as they assume that the future deployment environment is the same as the training environment (i.e. simulator) in which the policy is learned. This assumption often does not hold due to the discrepancy between the simulator and the real environment and, as a result, and hence renders the learned policy fragile when deployed. In this paper, we propose a novel distributionally robust Q-learning algorithm that learns the best policy in the worst distributional perturbation of the environment. Our algorithm first transforms the infinite-dimensional learning problem (since the environment MDP perturbation lies in an infinite-dimensional space) into a finite-dimensional dual problem and subsequently uses a multi-level Monte-Carlo scheme to approximate the dual value using samples from the simulator. Despite the complexity, we show that the resulting distributionally robust Q-learning algorithm asymptotically converges to optimal worst-case policy, thus making it robust to future environment changes. Simulation results further demonstrate its strong empirical robustness.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Reinforcement learning (RL) has demonstrated remarkable empirical success in simulated environments, with applications spanning domains such as robotics (Kober et al.,   1 New York University, Stern School of Business 2 Horizon Robotics Inc., CA, USA 3 Department of Management Science and Engineering, Stanford University, CA, USA 4 Department of Electrical Engineering and Computer Sciences, UC Berkeley, CA, USA 5 Arena Technologies. Correspondence to: Zhengyuan Zhou &lt;zzhou@stern.nyu.edu&gt;.</p><p>Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).</p><p>2013; <ref type="bibr">Gu et al., 2017)</ref>, computer vision <ref type="bibr">(Sadeghi &amp; Levine, 2016;</ref><ref type="bibr">Huang et al., 2017)</ref>, finance <ref type="bibr">(Li et al., 2009;</ref><ref type="bibr">Choi et al., 2009;</ref><ref type="bibr">Deng et al., 2017)</ref> and games <ref type="bibr">(Silver et al., 2016;</ref><ref type="bibr">2018)</ref>. More recently, reinforcement learning has also been applied to economic domains such as personalized promotions in revenue management, inventory control in supply chains and scheduling in queueing networks. However, carrying this success to real environments requires an attribute that is often missing in the existing literature on policy learning: robustness, because existing RL algorithms often make the implicit assumption that the environment in which the policy is trained will be the same as the environment in which the policy will be deployed. In other words, a policy is evaluated in the same environment from which the training data has been generated.</p><p>While the above assumption is arguably a good starting point 1 for developing a rigorous understanding of algorithmic performance of policy learning, it does not capture the complexity of real-world applications because the discrepancy between training and deployment environments is often common and hard to anticipate (and hence account for) in advance. Two common sources of discrepancies are:</p><p>1. Simulator model mis-specification. In many reinforcement learning (RL) applications, a policy is often first trained in a simulator before being deployed in a real environment. However, simulator models often cannot capture the full complexity of the real environment, and hence will be mis-specified. Further, it is hard to know exactly how the real environment differs from the simulator (otherwise, the simulator would have been augmented/modified to account for that). 2. Environment shifts. The underlying environment may shift due to either non-stationarity or a different deployment environment (for the same task). As an example of the latter, a personalized content recommendation 1 Much like supervised learning was first developed under the same assumption, before adversarial training was recognized as a valuable tool to make empirical predictions robust; see, for instance, <ref type="bibr">(Sinha et al., 2018;</ref><ref type="bibr">Goodfellow et al., 2014;</ref><ref type="bibr">Ganin et al., 2016;</ref><ref type="bibr">Zhang et al., 2019;</ref><ref type="bibr">Tram&#232;r et al., 2017)</ref>. The topic of domain adaptation <ref type="bibr">(Shimodaira, 2000;</ref><ref type="bibr">Saerens et al., 2002;</ref><ref type="bibr">Ben-David et al., 2010;</ref><ref type="bibr">Courty et al., 2016;</ref><ref type="bibr">Ganin et al., 2016;</ref><ref type="bibr">Sun et al., 2020;</ref><ref type="bibr">Arjovsky et al., 2019;</ref><ref type="bibr">Wang &amp; Deng, 2018;</ref><ref type="bibr">Sagawa et al., 2019)</ref> is another approach to address covariate shifts in supervised learning, but the new domain is often assumed to be known. engine (learned from existing user browsing data collected in one region or market) may need to be deployed in a different region or market, with different population characteristics. Another example occurs in robotics, where a robot trained to perform certain maneuvers (such as walking <ref type="bibr">(Schulman et al., 2013)</ref> or folding laundry <ref type="bibr">(Maitin-Shepard et al., 2010)</ref>) in an environment can fail catastrophically <ref type="bibr">(Drew, 2015)</ref> in a slightly different environment, where the terrain landscape (in walking) is slightly altered or the laundry object (in laundry folding) is positioned differently.</p><p>As a result, to learn an effective policy in practice, one must be cognizant of such discrepancies and take them into account during the training stage in order to learn a policy that is robust to the unknown environment changes that cannot be avoided and are difficult to know in advance. Whereas distributionally robust learning in the simpler supervised learning setting has been extensively studied in the past decade (see Section 1.1), an emerging literature has only recently been initiated to study this problem in the RL context <ref type="bibr">(Si et al., 2020b;</ref><ref type="bibr">Zhou et al., 2021;</ref><ref type="bibr">Yang et al., 2021;</ref><ref type="bibr">Kishan &amp; Kalathil, 2021)</ref>. In particular, a common and natural formulation adopted therein is to consider distributional shifts for rewards and/or transition probabilities (defined via different divergence measures or distances between probability distributions), and to learn from data (collected from some generative model) the best policy under the worstcase distributional shift, which thus carries a certain level of robustness to the unknown environment shifts.</p><p>Currently, all the existing distributionally robust policy learning algorithms mentioned above <ref type="bibr">(Si et al., 2020b;</ref><ref type="bibr">a;</ref><ref type="bibr">Zhou et al., 2021;</ref><ref type="bibr">Yang et al., 2021;</ref><ref type="bibr">Kishan &amp; Kalathil, 2021)</ref> are model-based, which estimate the underlying MDP first before provisioning some policy from it. Although model-based methods are often more sampleefficient and easier to analyze, their drawbacks are also wellunderstood <ref type="bibr">(Sutton &amp; Barto, 2018;</ref><ref type="bibr">Franc &#184;ois-Lavet et al., 2018)</ref>: they are computationally intensive; they require more memory to store MDP models and often do not generalize well to non-tabular RL settings. These issues limit the practical applicability of model-based algorithms, which stand in contrast to model-free algorithms that learn to select actions without first learning an MDP model. Such methods are often more computationally efficient, have less storage overhead, and better generalize to RL with function approximation. In particular, Q-learning <ref type="bibr">(Watkins &amp; Dayan, 1992)</ref>, as the prototypical model-free learning algorithm, has widely been both studied theoretically and deployed in practical applications. However, Q-learning is not robust (as demonstrated in our simulations), and the policy learned by Q-learning in one environment can perform poorly in another under a worst-case shift (with bounded magnitude). As such, we are naturally led to the following research ques-tion:</p><p>Can we design a variant of Q-Learning that is distributionally robust?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Related Work</head><p>Robustness has been studied in several settings that are related to (but different from) our investigation. For instance, a rich literature has explored distributionally robust learning and optimization in supervised learning <ref type="bibr">(Bertsimas &amp; Sim, 2004;</ref><ref type="bibr">Delage &amp; Ye, 2010;</ref><ref type="bibr">Hu &amp; Hong, 2013;</ref><ref type="bibr">Shafieezadeh-Abadeh et al., 2015;</ref><ref type="bibr">Bayraksan &amp; Love, 2015;</ref><ref type="bibr">Gao &amp; Kleywegt, 2016;</ref><ref type="bibr">Namkoong &amp; Duchi, 2016;</ref><ref type="bibr">Duchi et al., 2016;</ref><ref type="bibr">Staib &amp; Jegelka, 2017;</ref><ref type="bibr">Shapiro, 2017;</ref><ref type="bibr">Lam &amp; Zhou, 2017;</ref><ref type="bibr">Chen et al., 2018;</ref><ref type="bibr">Volpi et al., 2018;</ref><ref type="bibr">Lee &amp; Raginsky, 2018;</ref><ref type="bibr">Nguyen et al., 2018;</ref><ref type="bibr">Yang, 2020;</ref><ref type="bibr">Mohajerin Esfahani &amp; Kuhn, 2018;</ref><ref type="bibr">Zhao &amp; Jiang, 2017;</ref><ref type="bibr">Abadeh et al., 2018;</ref><ref type="bibr">Zhao &amp; Guan, 2018;</ref><ref type="bibr">Gao et al., 2018;</ref><ref type="bibr">Ghosh &amp; Lam, 2019;</ref><ref type="bibr">Blanchet &amp; Murthy, 2019;</ref><ref type="bibr">Duchi &amp; Namkoong, 2018;</ref><ref type="bibr">Lam, 2019;</ref><ref type="bibr">Duchi et al., 2019)</ref>, where the underlying testing distribution is still the same as the training distribution, and the learner merely uses the distributional uncertainty (around the empirical distribution) to guard against over-generalization due to lack of data. As such, the statistical learning results in this area focus on the setting where the distributional uncertainty decreases with the sample size (and as such, is part of the algorithm rather than the environment); further, it has been well recognized that under certain conditions, this approach is formally equivalent to regularization <ref type="bibr">(Duchi &amp; Namkoong, 2019;</ref><ref type="bibr">Duchi et al., 2016;</ref><ref type="bibr">Gao et al., 2017;</ref><ref type="bibr">Shafieezadeh-Abadeh et al., 2019)</ref>, which prevents over-fitting in the small data regime.</p><p>On the other hand, learning predictive rules in testing distributions that are different from -and often perturbations oftraining distributions have also been studied in <ref type="bibr">(Sinha et al., 2018;</ref><ref type="bibr">Goodfellow et al., 2014;</ref><ref type="bibr">Ganin et al., 2016;</ref><ref type="bibr">Zhang et al., 2019;</ref><ref type="bibr">Tram&#232;r et al., 2017)</ref>, where the training procedure is robustified by first perturbing the original dataset with some synthesized noise before solving a empirical risk minimization problem, which has been observed to work well in a robust manner.</p><p>Going beyond the supervised learning setting, a related area to ours is distributionally robust Markov decision processes (MDPs) <ref type="bibr">(Gonz&#225;lez-Trejo et al., 2002;</ref><ref type="bibr">Iyengar, 2005;</ref><ref type="bibr">Xu &amp; Mannor, 2010;</ref><ref type="bibr">El Ghaoui &amp; Nilim, 2005;</ref><ref type="bibr">Wiesemann et al., 2013;</ref><ref type="bibr">Wolff et al., 2012;</ref><ref type="bibr">Mannor et al., 2016;</ref><ref type="bibr">Morimoto &amp; Doya, 2005;</ref><ref type="bibr">Yang, 2020)</ref>. This line of work has studied the known environment MDP setting (hence no learning) and has mainly focused on the computational issues 2 . Closest to our work is the recent distributionally robust policy learning work already mentioned <ref type="bibr">(Si et al., 2020b;</ref><ref type="bibr">Zhou et al., 2021;</ref><ref type="bibr">Yang et al., 2021;</ref><ref type="bibr">Kishan &amp; Kalathil, 2021)</ref>: <ref type="bibr">(Si et al., 2020b)</ref> studies the special case of distributionally robust policy learning in contextual bandits (i.e. horizon is 1), while the other three study the infinite-horizon discounted RL setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Our Contributions</head><p>We design a distributionally robust Q-learning algorithm that has two features beyond the standard Q-learning algorithm. The first feature lies in the new values that the algorithm aims to learn: instead of Q-values, the algorithm now aims to estimate the distributionally robust Q-values. We achieve this by leveraging strong duality to transform the distributionally robust Q-values (an infinite-dimensional object since the distributional uncertainty set is infinitedimensional) into a finite-dimensional quantity, also known as the distributionally robust Bellman operator. Second, we design a novel multi-level Monte-Carlo estimator to unbiasedly estimate the distributionally robust Bellman operator. Through a careful analysis of our estimator's bias and variance (see Theorem 3.7 and Theorem 3.8), we show that the distributionally robust Q-learning algorithm asymptotically converges to optimal distributionally robust policy (Theorem 3.10). As such, our results provide an initial affirmative answer to the open question raised above, and the convergence result provides a distributionally robust counterpart to the well-known asymptotic convergence of Q-learning <ref type="bibr">(Jaakkola et al., 1994)</ref>. Finally, we provide simulation results to demonstrate the robustness of the policy learned by the proposed distributionally-robust Q-learning algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">Comparison with Existing Work</head><p>The distributionally robust Bellman equation was obtained before in <ref type="bibr">(Iyengar, 2005)</ref>. However, this is merely a tool for us to solve the problem in the dual space. Our main contribution is to design a distributionally robust Q-learning algorithm based on it. <ref type="bibr">(Xu &amp; Mannor, 2010)</ref> does not propose any new algorithm for learning a distributionally robust policy but only proved some properties of distributionally robust MDP. <ref type="bibr">(Yang, 2020)</ref> considers the Wasserstein distance and uses a model-based algorithm that is totally different from ours. Our algorithm is the first model-free algorithm ever developed on this problem. Prior to our work, it is not clear at all whether Q-learning can be made robust, since all the existing algorithms in this area are model-based.</p><p>computation can be done efficiently in polynomial time, although depending on how such uncertainty sets are specified, some of the proposed algorithms require oracle access to solving an infinitedimensional problem, and hence are infeasible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Distributionally Robust Policy Learning</head><p>with a Simulator</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Standard Policy Learning</head><p>Let M = (S, A, P, R, &#947;) be a tabular RL environment, where S and A are finite state space and action space respectively, R : S &#215; A &#8594; P(R &#8805;0 ) (the set of random variables that are supported on R &#8805;0 ) is the randomized reward function, P = {p s,a (&#8226;)} (s,a)&#8712;S&#215;A is the transition model, and &#947; &#8712; (0, 1) is the discount factor. We assume that the transition is Markovian, i.e., at each state s &#8712; S, if action a &#8712; A is chosen, then the subsequent state is determined by the conditional distribution p s,a (&#8226;) = p(&#8226;|s, a). The decision maker will therefore receive a randomized reward r(s, a). The value function V &#960; (s) provides the expected cumulative discounted reward under the policy &#960; with initial state s &#8712; S,</p><p>Hence, the optimal value function is:</p><p>where &#928; denotes the class of random policies. It is well known that the optimal value function is the solution of the following Bellman's equation:</p><p>From here we define the optimal Q-function, Q * as</p><p>Throughout this paper, we impose the following assumption on the rewards. Assumption 2.1. (Bounded rewards) For any (s, a) &#8712; S &#215; A, r(s, a) &#8712; [0, R max ].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">A Distributionally Robust Formulation</head><p>In reality, the transition model P and rewards R in M are subjected to change over time, this motivates us to learn a policy that is robust to certain perturbation in the environment. In particular, we consider the setting of distributionally robust RL, where both transition probabilities and rewards might be perturbed w.r.t. the Kullback-Leibler (KL) divergence D KL (P &#8741;Q) = log dP dQ dP whenever P &#8810; Q (P is absolutely continuous with respect to Q). Remark 2.2. We pick KL divergence simply because it is one common divergence measure that is easy to analyze (given the already complicated nature of the problem). Our results can also be generalized to f -divergence.</p><p>In the original environment, let P 0 = {p 0 s,a } (s,a)&#8712;S&#215;A be the transition probabilities, &#957; 0 be the joint distribution of {r(s, a)} (s,a)&#8712;S&#215;A , where r(s, a) &#8764; &#957; 0 s,a (marginal distribution with respect to (s, a)). To construct the distributional uncertainty setting, for each (s, a) &#8712; S &#215; A, we define robust KL balls that are centered at p 0 s,a and &#957; 0 s,a by</p><p>Here &#948; &gt; 0 is the level of distributional robustness, and &#8710; |S| stands for the |S|-1 dimensional probability simplex. Now we are able to build the uncertainty set P(&#948;) by the Cartesian product of P s,a (&#948;) for each (s, a) &#8712; S &#215; A. This type of uncertainty set is called (s, a)-rectangular set in standard literature <ref type="bibr">(Wiesemann et al., 2013)</ref>. Similarly we define R(&#948;) by the Cartesian product of R s,a (&#948;) for each (s, a) &#8712; S &#215; A. In the distributionally robust framework, the adversarial player is assumed to pick the worst-case transition model and rewards that minimize the expected cumulative discounted reward. To be clear, we define the distributionally robust value function as follows.</p><p>Definition 2.3. Given &#948; &gt; 0 and policy &#960; &#8712; &#928;, the distributionally robust value function V rob,&#960; &#948; is defined as:</p><p>(1)</p><p>Following the definition, V rob,&#960; &#948; measures the quality of a policy &#960; by computing its performance in the worst-case environment among the set of all possible environments that perturb the original transition P 0 and reward distribution &#957; 0 under a &#948;-KL ball. The optimal distributionally robust value function is therefore defined by</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Strong Duality</head><p>Following the well known results in <ref type="bibr">(Iyengar, 2005;</ref><ref type="bibr">Xu &amp; Mannor, 2010)</ref>, we can write down the distributionally robust dynamic programing for the distributionally robust value function V rob,&#960; &#948; in Equation (1) as follows:</p><p>(2)</p><p>Moreover, we can write down the distributionally robust Bellman's equation for the optimal value function as follows:</p><p>(3)</p><p>Note that both Equation (2) and Equation ( <ref type="formula">3</ref>) are in general computationally intractable since they involve infinite dimensional optimization. To address this issue, we introduce the following strong duality lemma from distributionally robust optimization under KL-perturbation.</p><p>Lemma 2.4 ( <ref type="bibr">(Hu &amp; Hong, 2013)</ref>, Theorem 1). Suppose H(X) has finite moment generating function in the neighborhood of zero. Then for any &#948; &gt; 0, sup</p><p>By Lemma 2.4, we can transform the Equation (2) to the following equation.</p><p>As a direct consequence of the Equation (4) (note that the size of &#928; is finite in the tabular setting, see Theorem 3.2 in <ref type="bibr">(Iyengar, 2005)</ref> for a standard proof), the optimal distributionally robust value function V rob, * &#948; in fact satisfies the following distributionally robust Bellman's equation.</p><p>-&#946;&#948; .</p><p>(5)</p><p>The Q-learning algorithm determines the optimal Qfunction using point samples. Let &#960; be some random policy such that P(&#960;(s) = a) &gt; 0 for all state-action pairs (s, a) &#8712; S &#215; A. Suppose at time t, we draw a sample (s t , a t , r t , s &#8242; t ) from the environment according to the policy &#960;. Then, Q-learning uses the following update rules</p><p>where the step-sizes &#945; t will be properly chosen. The key reason why Q-learning works is that both r t and</p><p>Therefore we can use the stochastic approximation theorem to show that Q t converges to Q * with careful choice of step-sizes. For more details, please read <ref type="bibr">(Melo, 2001;</ref><ref type="bibr">Even-Dar et al., 2003)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Distributionally Robust Q-Learning</head><p>From Equation (3) and Equation ( <ref type="formula">5</ref>), we know the optimal distributionally robust Q</p><p>By analogy with the Bellman Operator, we can define the &#948;-distributionally robust Bellman Operator as follows:</p><p>Definition 3.1. Given &#948; &gt; 0 and Q &#8712; R S&#215;A , the &#948;distributionally robust Bellman Operator T rob &#948; : R S&#215;A &#8594; R S&#215;A is defined as</p><p>Remark 3.2. We define the &#948;-distributionally robust Bellman Operator by using its dual form. However, it may be more convenient to use the primal form</p><p>Our definition implies that Q rob, * &#948; is a fixed point of T rob &#948; . Now suppose we have a simulator that allows us to sample (r, s &#8242; ) from (&#957; 0 s,a , p 0 s,a ), can we come up with a nice unbiased estimator of T rob &#948; (Q)(s, a)? The plug-in estimator of T rob &#948; (Q)(s, a) is in fact biased (because of the non-linearity). For instance, if we just take one sample, say (s, a, r, s &#8242; ). The corresponding plug-in estimator of</p><p>To address this issue, we propose a new estimator of T rob &#948; by introducing the multilevel Monte-Carlo method <ref type="bibr">(Blanchet &amp; Glynn, 2015;</ref><ref type="bibr">Blanchet et al., 2019)</ref>. The formal description of our estimator is defined as follows:</p><p>Definition 3.3. Given &#948; &gt; 0, &#949; &#8712; (0, 0.5) and Q &#8712; R S&#215;A , the &#948;-distributionally robust estimator T rob &#948;,&#949; is defined as:</p><p>For R rob &#948; (s, a) and T rob &#948; (s, a), we firstly sample N &#8712; N from the distribution P(N = n) = p n = &#949;(1 -&#949;) n , then use the simulator to draw 2 N +1 samples (r i , s &#8242; i ) from (&#957; 0 s,a , p 0 s,a ). Finally we define</p><p>where</p><p>and</p><p>Using the estimator T rob &#948;,&#949; , our Distributionally Robust Q-Learning algorithm is summarized in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 Distributionally Robust Q-Learning</head><p>Input: Uncertainty radius &#948; &gt; 0, parameter &#949; &#8712; (0, 0.5).</p><p>) through Equation (11) and Equation ( <ref type="formula">12</ref>) respectively.</p><p>Remark 3.4. In Algorithm 1, we can choose any &#945; t which satisfies the Robbins-Monro Condition, i.e.,</p><p>&#948; is an exogenous variable that quantifies the level of conservatism, which we consider as pre-determined (and not part of the algorithm). That said, it can also be from past datasets if they are collected from different environments, in which case the shift can be estimated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Theoretical Guarantee</head><p>First of all, we introduce Lemma 3.6 which plays a critical role in the proof of convergence of Algorithm 1.</p><p>Lemma 3.6. Given &#948; &gt; 0, 0 &lt; &#947; &lt; 1, T rob &#948; is a &#947;contraction map w.r.t. the infinity norm.</p><p>By applying the primal form of T rob &#948; (Equation ( <ref type="formula">7</ref>)), we have</p><p>Consider any &#1013; &gt; 0, suppose p(&#1013;) &#8712; P s,a (&#948;) makes</p><p>Combining the previous part, we have &#8704;&#1013; &gt; 0,</p><p>which implies</p><p>By a similar argument, we also have</p><p>Hence, there is</p><p>Note that the above result is true for any (s, a) &#8712; S &#215; A, which implies</p><p>Next, we give the key theorem of our estimator T rob &#948;,&#949; . Theorem 3.7. Given &#948; &gt; 0. If Assumption 2.1 holds, then for any &#949; &#8712; (0, 0.5), T rob &#948;,&#949; is an unbiased estimator of T rob &#948; , i.e., &#8704;Q &#8712; R S&#215;A , (s, a) &#8712; S &#215; A,</p><p>Due to the page limitation, we defer the whole proof of Theorem 3.7 into Appendix B and give a proof outline here. By the law of total probability, we first can get</p><p>One key technical point helping us to move on is to understand</p><p>s,a e -r(s,a)/&#945; , and &#946; * defined in a similar way. By employing different events defined in Appendix B, we finally can derive</p><p>which completes the unbiasedness of R rob &#948; (s, a). Similar result will give us the unbiasedness of T rob &#948; (Q)(s, a).</p><p>Besides, we introduce another key property of T rob &#948;,&#949; . Theorem 3.8. Given &#948; &gt; 0, &#949; &#8712; (0, 0.5). If Assumption 2.1 holds, then there exists a constant C(&#948;, &#949;, &#957; 0 , P 0 ) &gt; 0 such that for any Q &#8712; R S&#215;A and (s, a) &#8712; S &#215; A,</p><p>The proof of Theorem 3.8 is also deferred to Appendix C. Remark 3.9. For Theorem 3.8, our setting is different from <ref type="bibr">(Blanchet &amp; Glynn, 2015;</ref><ref type="bibr">Blanchet et al., 2019)</ref>, which requires a careful analysis and a different proof strategy.</p><p>One key difference is that expectation terms in Equation ( <ref type="formula">6</ref>) are inside the log function. However, in <ref type="bibr">(Blanchet &amp; Glynn, 2015;</ref><ref type="bibr">Blanchet et al., 2019)</ref>, expectations are always at the most outside. The non-linearity of the log function makes the analysis more difficult. Another technical point is we need to understand the optimizers of T rob &#948; , i.e.,</p><p>where proof techniques are different for &#945; * , &#946; * = 0 and &#945; * , &#946; * &#824; = 0.</p><p>Finally, combing previous results, we are able to establish the following convergence guarantee for Algorithm 1.</p><p>Theorem 3.10. Given &#948; &gt; 0. If Assumption 2.1 holds, then for any &#949; &#8712; (0, 0.5), Q rob &#948;,t in Algorithm 1 will converge to Q rob, * &#948; as t &#8594; &#8734;.</p><p>Proof. Following the proof in <ref type="bibr">(Melo, 2001;</ref><ref type="bibr">Even-Dar et al., 2003)</ref>, we can rewrite the update rule of Algorithm 1 as</p><p>where</p><p>. Combining Lemma 3.6, Theorem 3.7 and the fact that</p><p>Next, by Theorem 3.8, we have</p><p>Also note that &#945; t satisfies the Robbins-Monro Condition, these intermediate results then yield the convergence of Algorithm 1 by Theorem 1 of <ref type="bibr">(Jaakkola et al., 1994)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Numerical Experiments</head><p>We use a supply chain model to test Algorithm 1 in our numerical experiments. In the supply chain model, the state space In our experiments, due to the limit computation resources, we fix n = 10. Besides, we take h = 1, p = 2, k = 3 and set the discount factor &#947; = 0.9 with starting from s 1 = 0.</p><p>By using standard Q-learning, one can find the non distributionally robust, optimal policy &#960; * for the non perturbed problem, i.e., &#948; = 0, satisfying</p><p>For the distributionally robust setting, we can think the probability distribution of d t is no longer a uniform distribution anymore, which is quite reasonable in reality.</p><p>In the simulation, we set &#948; = 1 as the perturbation parameter. At the k-th step of Algorithm 1, we set the learning rate &#945; k be 1 1+(1-&#947;)(k-1) to satisfy the Robbins-Monro Condition. We will treat Algorithm 1 as having converged when the infinity norm of the difference between the updated value and the old value is no greater than tolerance = 0.05.</p><p>For the parameter &#949; used in our estimator, we consider &#949; &#8712; {0.49, 0.499.0.5, 0.6}. Note that our Theorem 3.10 only ensures the convergence for &#949; &#8712; (0, 0.5), but we will test some values out of this range. Besides, the reason why we only choose {0.49, 0.499} rather than adding some other smaller &#949; &#8712; (0.0.5) is that from the construction of our estimator, we can find smaller &#949; will lead to a huge number of samples with high probability, to avoid this problem, we only test &#949; close to 0.5. To reduce the effect of variance, for every &#949;, we will run 5 times and use the averaged value as the final output Q rob &#948;,&#949; . Finally, we use Q rob &#948;,&#949; to find the greedy policy</p><p>for every &#949;. The output greedy policy &#960; rob &#948;,&#949; for every &#949; is listed in Table <ref type="table">1</ref>. We can see all policies are the same as</p><p>In Table <ref type="table">2</ref>, we list the averaged number of iterations and averaged number of samples used for every &#949;. Combing our results in Tables <ref type="table">1</ref> and<ref type="table">2</ref>, it shows that Algorithm 1 may converge with even less samples when &#949; / &#8712; (0, 0.5) . Now we define a perturbed uniform distribution with parameter m and b as follows:</p><p>With the perturbed distribution, we test our distributionally robuast policy &#960; rob &#948; and non distributionally robuast policy &#960; * for b &#8712; {1, 1.5, 2, 2.5} (Note that b can not be too large, otherwise there will exist some pair (s, a) such that D KL (p s,a &#8741;p 0 s,a ) &gt; &#948; or D KL (&#957; s,a &#8741;&#957; 0 s,a ) &gt; &#948;) and every m &#8712; [n -1]. We report the total cost averaged over 2000 runs for different b in Figures <ref type="figure">1 to 4</ref>.</p><p>With varying test probabilities, our distributionally robust policy performs better in worst cases when the probability distribution of demand is centered in {5, 6, 7} instead of the uniform distribution, demonstrating again the effectiveness of our proposed distributionally robust formulations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this paper, we proposed a novel unbiased estimator of the distributionally robust Bellman Operator. By using the  Several open problems suggest itself. First, although asymptotic convergence is desirable, it would also be interesting to obtain finite-sample guarantees for distributionally robust Q-learning algorithm. We believe such a result would require a completely different line of analysis, whose scope goes significantly beyond this paper. Second, this paper focused on the infinite-horizon discounted RL setting. A much more challenging but also highly useful setting to consider is the (infinite horizon) average reward RL <ref type="bibr">(Dong et al., 2021)</ref>. RL in this setting is less explored, and distributionally robust policy learning in this setting poses significant technical challenges. Finally, another important direction of research is to generalize the results to high-dimensional state settings <ref type="bibr">(Ren &amp; Zhou, 2020)</ref>, where the intrinsic dimension is low. Data efficiency in such settings will be of particular importance. We look forward to these problems being addressed by the emerging distributionally robust reinforcement learning community.</p><p>First of all, we introduce two ancillary concentration inequalities. Lemma A.1 ( <ref type="bibr">(Fournier &amp; Guillin, 2015)</ref>, Concentration inequality for Wasserstein distance ). For &#181; &#8712; P(R), we consider an i.i.d. sequence (X k ) k&#8805;1 of &#181;-distributed random variables and, for all n &#8805; 1, the empirical measure</p><p>Assume that there exists &#947; &gt; 0 such that E 2,&#947; (&#181;) := R exp(&#947;|x| 2 )&#181;(dx) &lt; &#8734;. Then for all n &#8805; 1, all x &gt; 0,</p><p>where the Wasserstein distance W(&#181; n , &#181;) is defined by</p><p>|x -y|&#960;(dx, dy) , and the positive constant C and c depends only on &#947; and E 2,&#947; (&#181;). Lemma A.2 (Hoeffding's inequality).</p><p>Then for every t &gt; 0,</p><p>B. Missing Proof of Theorem 3.7</p><p>Proof. Fix a pair (s, a) &#8712; S &#215; A, we first prove R rob &#948; (s, a) defined in Equation ( <ref type="formula">9</ref>) is an unbiased estimator of sup &#945;&#8805;0 -&#945; log E &#957; 0 s,a e -r(s,a)/&#945; -&#945;&#948; which forms the first part of our &#948;-distributionally robust Bellman Operator (see Equation ( <ref type="formula">6</ref>)). Let g(&#945;) = -&#945; log E &#957; 0 s,a e -r(s,a)/&#945; -&#945;&#948;.</p><p>) is defined by using the subset of 2 N samples in which the index of every element is odd (resp. even). We use g 2 N to represent the function defined by replacing &#957; 0 s,a by &#957; 2 N in g and &#945; * 2 N = argmax &#945;&#8805;0 g 2 N (&#945;). Similarly, we also have</p><p>We only need to show</p><p>Based on our Assumption 2.1, we can give an upper bound of &#945; * by observing</p><p>By a similar argument, &#945; * 2 n &#8804; R max /&#948; also holds. Besides, from above derivation, we can find</p><p>holds by a similar argument. Starting from here, we split the proof into two cases: &#945; * = 0 and &#945; * &gt; 0.</p><p>&#8226; &#945; * = 0. By proposition 2 in <ref type="bibr">(Hu &amp; Hong, 2013)</ref>, &#945; * = 0 implies &#955; := &#957; 0 s,a (r(s, a) = ess inf r(s, a)) &gt; 0. We first introduce the following two events:</p><p>Note that we have the following result for G 2 n :</p><p>By Lemma 4 in <ref type="bibr">(Zhou et al., 2021)</ref>, &#8704;&#1013; &gt; 0, there exists a constant N (&#1013;, &#948;), such that &#8704;n &#8805; N (&#1013;, &#948;),</p><p>where the first inequality is by Equation ( <ref type="formula">14</ref>) and Equation ( <ref type="formula">15</ref>). Hence we know</p><p>Using above result, Equation ( <ref type="formula">13</ref>) holds immediately in this case.</p><p>&#8226; &#945; * &gt; 0. We define the following event in this case,</p><p>By Lemma 4 in <ref type="bibr">(Zhou et al., 2021)</ref>, &#8704;&#1013; &gt; 0, there exists a constant N &#8242; (&#1013;), such that once n &#8805; N &#8242; (&#1013;), we have</p><p>Observe that E &#957; 0 s,a e -r(s,a)/&#945; &#8805; e -Rmax/&#945; and E &#957; 2 n e -r(s,a)/&#945; &#8805; e -Rmax/&#945; hold, combining the Lipschitz property of log x when x is bounded below, we know</p><p>Then we have</p><p>For any &#945; &#8712; &#945; * 2 , Rmax &#948; , the function e -x/&#945; is a Lipschitz function on [0, &#8734;), and the Lipschitz constant is bounded by 2/&#945; * . Hence, by the dual representation of Wasserstein distance, we have</p><p>where the Wasserstein distance W(&#957; 0 s,a , &#957; 2 n ) is defined by</p><p>|x -y|d&#960;(x, y) .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Now we know</head><p>By Lemma A.1 (note that we assume r is bounded, so the condition for Lemma A.1 is satisfied automatically), there exists c, C &gt; 0 such that</p><p>Note that we choose &#1013; &gt; 0 arbitrarily, so there is</p><p>Hence we know Equation ( <ref type="formula">13</ref>) holds in this case.</p><p>Now we can conclude that R rob &#948; (s, a) is an unbiased estimator of sup &#945;&#8805;0 -&#945; log E &#957; 0 s,a e -r(s,a)/&#945; -&#945;&#948; .</p><p>From here we give the proof that T rob &#948; (Q)(s, a) defined in Equation ( <ref type="formula">10</ref>) is an unbiased estimator of sup &#946;&#8805;0 -&#946; log E p 0 s,a e -max b&#8712;A Q(s &#8242; ,b)/&#946; -&#946;&#948; which constructs the remaining part of our &#948;-distributionally robust Bellman Operator. Given Q &#8712; R S&#215;A and (s, a) &#8712; S &#215; A. We define V (s &#8242; ) = max b&#8712;A Q(s &#8242; , b) for any s &#8242; &#8712; S and</p><p>) is defined by using the subset of 2 N samples in which the index of every element is odd (resp. even). We use f 2 N to represent the function defined by replacing p 0 s,a by p 2 N in f and &#946; * 2 N = argmax &#946;&#8805;0 f 2 N (&#946;). Similarly, we also have</p><p>. Following the similar proof for g used in the previous part, we will have</p><p>Hence we only need to show</p><p>We can observe that both &#946; * and &#946; * 2 n are no bigger than 2&#8741;Q&#8741; &#8734; /&#948;. For &#946; * , this is because</p><p>If we apply the similar argument to &#946; * 2 n , we can get the same result. Besides, from the above derivation, we can see</p><p>also holds with a similar reason. Now we define the following two key events,</p><p>Note the facts P [E n 2 ] = 1 and</p><p>where the last inequality is right because of Lemma A.2. With the above proposition, we can start to show that Equation ( <ref type="formula">17</ref>) is true, we consider the following decomposition</p><p>, where the last inequality holds due to Equation (18) and Equation ( <ref type="formula">19</ref>). Now denote min s &#8242; &#8712;S,p 0 s,a (s &#8242; )&gt;0 V (s &#8242; ) as v. Note that we have shown 0 &#8804; &#946; * , &#946; * 2 n &#8804; 2&#8741;Q&#8741; &#8734; /&#948;. Observe that under T 2 n and E 2 n , &#8704;&#946; &#8712; [0, 2&#8741;Q&#8741; &#8734; /&#948;], there are</p><p>and</p><p>Then under T 2 n and E 2 n , we know</p><p>The third inequality is right is by the Lipschitz property for log x when x is bounded from below. Finally we have</p><p>where the second inequality is by Lemma A.2. Hence we have</p><p>This is enough to conclude</p><p>Note that Equation (21) implies that Equation ( <ref type="formula">17</ref>) is true. Now we complete our partial proof, i.e., T rob &#948; (Q)(s, a) is an unbiased estimator of sup &#946;&#8805;0 -&#946; log E p 0 s,a e -max b&#8712;A Q(s &#8242; ,b)/&#946; -&#946;&#948; .</p><p>Finally we complete the proof that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Missing Proof of Theorem 3.8</head><p>Proof. Following the same notations defined in the proof of Theorem 3.7 (cf. Appendix B), we have already known for any &#949; &#8712; (0, 0.5), Q &#8712; R S&#215;A and (s, a) &#8712; S &#215; A,</p><p>By the definition of variance we have</p><p>We first analysis the term Var[ R rob &#948; (s, a)] by noticing</p><p>Like the proof of Theorem 3.7, we also consider following two cases</p><p>&#8226; &#945; * = 0. By Proposition 2 in <ref type="bibr">(Hu &amp; Hong, 2013)</ref>, &#945; * = 0 if and only if &#955; := &#957; 0 s,a (r(s, a) = ess inf r(s, a)) &gt; 0 and &#955; &#8805; e -&#948; . Since &#948; is chosen by us, we can ignore the edge case &#955; = e -&#948; by introducing randomness on &#948;. Now we define the following three events: <ref type="bibr">(Hu &amp; Hong, 2013</ref>) implies that we have &#945; * <ref type="bibr">, a)</ref>. The same argument can be applied to the subscript 2 n E and 2 n . Besides, note that</p><p>Note the following two bounds</p><p>The first bound is due to Lemma A.2 again. Finally we have</p><p>Then if we choose p n = &#949;(1 -&#949;) n , we know</p><p>Note that &#955; depends on (s, a). However, the number of pair (s, a) is finite, hence we can find a uniform constant bound K 1 (&#948;, &#949;, &#957; 0 ) &#8805; K &#8242; 1 (&#955;, &#948;, &#949;, &#957; 0 ) for any (s, a) &#8712; S &#215; A which makes &#945; * = 0.</p><p>&#8226; &#945; * &gt; 0. In this case, we follow the way proposed by <ref type="bibr">(Zhou et al., 2021)</ref>. Define</p><p>where &#945; = &#945; * /2 and &#945; = R max /&#948;. Besides, we introduce the following event</p><p>Note that by Lemma 4 in <ref type="bibr">(Zhou et al., 2021)</ref>, under where the last inequality holds by a similar argument for Equation ( <ref type="formula">16</ref>). Using a similar calculation used in the proof of Theorem 3.7, we can find</p><p>Besides, by Lemma 4 in <ref type="bibr">(Zhou et al., 2021)</ref>, we know P[F c 2 n ] = O(e -2 n ), Besides, note p n = &#949;(1-&#949;) n for &#949; &#8712; (0, 0.5). Since we are in the tabular setting ,we can find a constant K 2 (&#948;, &#949;, &#957; 0 ) &gt; K &#8242; 2 (&#945; * , &#948;, &#949;, &#957; 0 ) for any (s, a) which makes &#945; * &gt; 0.</p><p>By the previous two cases, we know E[( R rob &#948; (s, a) -g(&#945; * )) 2 ] &#8804; max(K 1 (&#948;, &#949;, &#957; 0 ), K 2 (&#948;, &#949;, &#957; 0 )). Now we start to analysis Var[ T rob &#948; (Q)(s, a)] = E[( T rob &#948; (Q)(s, a) -f (&#946; * )) 2 ]. By the similar approach, we can find</p><p>Use the same notations we defined in proof of Theorem 3.7, we can find 2 n ] = 0 from the proof of Theorem 3.7. Note that p n = &#949;(1 -&#949;) n for &#949; &#8712; (0, 0.5), we have</p><p>Now we can find a uniform bound K 3 (&#948;, &#949;, P 0 ) &gt; 0 such that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Thues we can see</head><p>Var[ T rob &#948;,&#949; (Q)(s, a)] &#8804; 2 max(K 1 (&#948;, &#949;, &#957; 0 ), K 2 (&#948;, &#949;, &#957; 0 )) + 2&#947; 2 K 3 (&#948;, &#949;, P 0 )(1 + &#8741;Q&#8741; 2 &#8734; ) &#8804; C(&#948;, &#949;, &#957; 0 , P 0 )(1 + &#8741;Q&#8741; 2 &#8734; ), where C(&#948;, &#949;, &#957; 0 , P 0 ) &gt; 0 is some uniform constant. Hence we complete the proof.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>For instance, they have characterized various types of uncertainty sets, and have shown that for almost all of them, the optimal distributionally robust policy is NP-hard to compute. For rectangular uncertainty sets (to which our formulation belongs), such</p></note>
		</body>
		</text>
</TEI>
