<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Policy Gradient using Weak Derivatives for Reinforcement Learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10161725</idno>
					<idno type="doi">10.1109/CDC40024.2019.9029403</idno>
					<title level='j'>2019 IEEE 58th Conference on Decision and Control</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Sujay Bhatt</author><author>Alec Koppel</author><author>Vikram Krishnamurthy</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results: (i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be O(1/ k); (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment illustrate the superior performance of the proposed algorithm.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Reinforcement Learning (RL) is a form of implicit stochastic adaptive control where the optimal control policy is estimated without directly estimating the underlying model. This paper considers reinforcement learning for an infinite horizon discounted cost continuous state Markov decision process. In a MDP, actions affect the Markovian state dynamics and result in rewards for the agent. The objective is to find a map from the states to actions, also known as policy, that results in the accumulation of largest expected return. There are many approaches to estimate a policy: policy iteration, Q-learning <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref> (which operates in "value" space <ref type="bibr">[3]</ref>), policy-gradients <ref type="bibr">[4]</ref>, <ref type="bibr">[5]</ref> (that operate in policy space); see <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>.</p><p>Recently, policy-gradient algorithms have gained popularity due to their ability to address complex real-world RL problems with continuous state-action spaces. Given a parametrized policy space, usually designed to incorporate domain knowledge, policy-gradient algorithms update policy parameters along an estimated ascent direction of the expected return. Depending on whether the expected reward or the value function is convex or non-convex, the parameters converge to a minimum or a stationary point; for a comprehensive survey see <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>.</p><p>Typically, to compute the ascent direction in policy search <ref type="bibr">[10]</ref>, one employs the Policy Gradient Theorem <ref type="bibr">[7]</ref> to write the gradient as the product of two factors: the Q-function <ref type="foot">1</ref> and the score function (a likelihood ratio). This score function approach has yielded numerous policy search techniques <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>, <ref type="bibr">[7]</ref>, although the resulting gradient estimates are afflicted with high variance: the score function is a martingale and so for a Markov process its variance is O(N ) for N measurements. In pursuit of reducing the variance, we propose replacing the score function with the weak derivatives; see <ref type="bibr">[14]</ref> for a textbook treatment. <ref type="foot">2</ref> Weak and measured-valued derivatives have been used for real-time reinforcement learning of constrained average cost MDPs (with finite action spaces) in <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>. These papers derive constant step size policy gradient algorithms and show analytically and via numerical examples that substantial variance reduction can be achieved compared to the score function method; moreover the optimal (randomized) policy can be tracked over time when the unknown constrained MDP parameters evolve.</p><p>In comparison to <ref type="bibr">[16]</ref>, <ref type="bibr">[20]</ref>, this paper considers off-line (decreasing step size) reinforcement learning for continuous state-continuous action infinite horizon discounted cost MDPs when the underlying system can be simulated using statistically independent trials with different policies. To estimate the Q-function in the policy gradient <ref type="bibr">[7]</ref>, we use Monte Carlo roll-outs with random path lengths akin to <ref type="bibr">[21]</ref>, motivated by the fact that obtaining unbiased estimates of continuous stateaction Q-function in the infinite horizon case is otherwise challenging. The product of these terms yields a valid estimate of the overall policy gradient, as in <ref type="bibr">[7]</ref>.</p><p>Our main results are: 1) A decreasing step size policy gradient algorithm using Jordan decomposition for the policy gradient. We establish that the resulting policy gradient algorithm, named Policy Gradient with Jordan Decomposition (PG-JD), yields unbiased estimates of the gradient of the reward function. 2) to establish that the PG-JD algorithm converges to a stationary point of the parametrized value function almost surely under decreasing step-sizes. 3) to derive the iteration (and sample<ref type="foot">foot_2</ref> ) complexity as</p><p>where k is the time step. This shows that the convergence rate is similar to stochastic gradient method for non-convex settings. 4) to upper-bound the expected variance of the gradient estimates obtained using the PG-JD algorithm, which isshown to be lower than those generated by score function methods using Monte Carlo roll-outs with random path lengths, for common policy parametrizations. The setup and problem formulation are discussed in Sec. II. The new policy gradient theorem using weak derivatives (Jordan decomposition) is derived in Sec. III. The algorithm to compute the stochastic gradient and the policy parameter update is given in Sec. IV. Convergence analysis of the stochastic gradient ascent algorithm and its statistical properties are derived in Sec. V. Numerical studies on OpenAI gym using the pendulum environment is discussed in Sec. VI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. PROBLEM FORMULATION AND POLICY SEARCH</head><p>The problem of reinforcement learning is considered in the framework of Markov Decision Process, which is defined as a tuple (X , A, T , r, &#947;) consisting of the state space X &#8838; R p , a subset of Euclidean space with elements x &#8712; X ; the action space A &#8838; R q , a subset of Euclidean space with elements a &#8712; A; the transition law T , a probability density function T (&#8226;|a, x) &#8712; P(X ) that assigns a next-state upon taking action a in state x, where P(X ) denotes the set of all probability measures on X ; the reward function r(x, a), a real valued function on the product space X &#215; A; the discount &#947; &#8712; (0, 1), a parameter that scales the importance of future rewards.</p><p>A stochastic Markov policy &#181; = {&#181; k } is defined as a sequence of transition probabilities from X to A such that &#181; k (D(x)|x) = 1 for each x &#8712; X and k = 0, 1, &#8226; &#8226; &#8226; . Here D maps each x &#8712; X to the set of all available actions D(x). Let &#931; denote the class of stochastic Markov policies.</p><p>For an initial state x 0 and a stochastic Markov policy &#181; &#8712; &#931;, define the expected reward function</p><p>For an initial state x 0 and a Markov policy &#181; &#8712; &#931;, using Ionescu Tulcea theorem <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, define P x0 &#181; as</p><p>Here &#181; 0 &#8712; P(X ) is an atomic measure with &#181; 0 (x 0 ) = 1. The expectation E x0 &#181; in (1) is with respect to P x0 &#181; in (2). Our goal is to find the policy &#181; that maximizes the long-term reward accumulation, or value:</p><p>For the infinite horizon problem (3), it is sufficient <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>, <ref type="bibr">[23]</ref>, <ref type="bibr">[26]</ref> to restrict the class &#931; of policies to the class &#931; s &#8834; &#931; of stationary stochastic Markov policies. A stationary stochastic Markov policy &#181;(= {&#181;}) &#8712; &#931; s is defined as the transition probability from X to A such that &#181;(D(x)|x) = 1 for each x &#8712; X . In order to solve (3) we resort to direct policy search over the space of continuous stationary policies. It is convenient to parametrize the stationary policy &#181;(&#8226;|&#8226;) as &#181; &#952; (&#8226;|&#8226;) for &#952; &#8712; &#920; &#8838; R d , for d &#8712; N, and search over the space of &#952;. For example, consider Gaussian policy &#181; &#952; (&#8226;|x) = N (&#952; &#966;(x), &#963; 2 ). Here the function &#966;(&#8226;) is commonly referred to as the feature map and &#963; denotes the standard deviation. With a slight abuse of notation, the problem (3) can be reformulated in terms of the finding a parameter vector &#952; to satisfy:</p><p>Here E x0 &#181; &#952; is the expectation with respect to the measure induced by the probability measure as in <ref type="bibr">(2)</ref> with the policy &#181; &#952; = {&#181; &#952; } and initial state x 0 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. POLICY GRADIENT THEOREM VIA HAHN-JORDAN</head><p>The foundation of any valid policy search technique is a valid ascent direction on the value function with respect to the policy parameters. Classically, one may derive that the policy gradient decomposes into two factors: the action-value (Q) function and the score function <ref type="bibr">[4]</ref>. Here we establish that one may obviate the need for the log trick that gives rise to the score function through measure-valued differentiation by employing the Jordan decomposition of signed measures <ref type="bibr">[15]</ref>. To begin doing so, define the Q-function as</p><p>The weak derivative of the signed measure &#8711;&#181; &#952; (&#8226;|x) using Jordan decomposition 4 is given as</p><p>Here the decomposed positive and negative component measures &#181; &#8853; &#952; (&#8226;|x) and &#181; &#952; (&#8226;|x) are orthogonal in L 2 (see Example 1 below). The ergodic measure associated with the transition kernel T (&#8226;|x 0 , a 0 ) and policy</p><p>. Using this measure (weak) derivative representation of the policy, we can write the gradient of the value function with respect to policy parameters &#952; in an unusual way which is given in the following theorem. [15] [Hahn Decomposition] Let &#181; be a finite signed measure on the measurable space (&#8486;, F ). There exists a disjoint partition of the set &#8486; into &#8486; + and &#8486; -such that &#8486; = &#8486; + &#8746; &#8486; -, &#181;(A) &#8805; 0, &#8704;A &#8834; &#8486; + , and &#181;(B) &#8804; 0, &#8704;B &#8834; &#8486; -.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Result 2. [15]</head><p>[Jordan Decomposition] Every finite signed measure &#181; has a unique decomposition into a difference &#181; = &#181; +&#181; -of two finite non-negative measures &#181; + and &#181; -such that for any Hahn decomposition</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 1. (Jordan Decomposition for Policy Gradients)</head><p>The policy gradient using Jordan decomposition takes the form</p><p>)</p><p>where g(&#952;, x) is a normalizing constant to ensure &#181; &#8853; and &#181; are valid measures.</p><p>Discussion: The proof can be found in <ref type="bibr">[27]</ref>. Theorem 1 is the policy gradient theorem using weak derivatives, specifically Jordan decomposition. In Theorem 1, note that the Q functions in the expectations are the same, indicating that the model is unaffected by the measure decomposition; only the induced measures are different. The expression for the gradient in <ref type="bibr">(7)</ref> contains a difference of two expectations. Unlike, the method of score functions, the expectation obviates the need for a score function term. Intuitively, this allows us to avoid computing the logarithm of the policy which may amplify useless parts of the state-action space and cause variance to needlessly be increased, and instead yield a sharp "perceptron-like" behavior. In subsequent sections, we indeed establish that this representation may reduce variance but this reduction intrinsically depends on the policy parameterization. Note that g(&#952;, x) for a given parameter &#952; and state x, is a constant, which makes the stochastic gradient easier to compute in Algorithm 2. Before continuing, we present a representative example; see <ref type="bibr">[14]</ref> for several examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Example 1.</head><p>Consider a gaussian policy &#181; &#952; (&#8226;|x) = N (&#952; &#966;(x), &#963; 2 ), where the mean of the gaussian distribution is modulated by the optimization parameter. The Jordan decomposition of the gaussian policy can be derived as follows:</p><p>Here we may glean the normalizing constant g(&#952;, x) = &#966;(x)</p><p>and the positive and negative component measures are</p><p>Observe that &#181; &#8853; &#952; (&#8226;|x) and &#181; &#952; (&#8226;|x) define the Rayleigh<ref type="foot">foot_4</ref> policy. They are orthogonal in the sense that &#181; &#8853; &#952; (&#8226;|x) is defined on <ref type="foot">6</ref>&#967;(a &gt; &#952; &#966;(x)) and &#181; &#952; (&#8226;|x) is defined over &#967;(a &lt; &#952; &#966;(x)).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. POLICY SEARCH VIA JORDAN DECOMPOSITION</head><p>In order to develop a policy search method based on Theorem 1, we need samples of both factors inside the expectation in <ref type="bibr">(7)</ref> which are unbiased. We first focus on the later factor, the Q-function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Estimating the Action-Value</head><p>The estimation of the Q-function is carried out using Monte Carlo roll-outs of random path lengths, similar to <ref type="bibr">[21]</ref>.</p><p>Here the random length is a geometric random variable with parameter &#947;, the discount factor in the reinforcement learning problem. Specifically, we simulate T &#8764; Geom (1&#947;) and then simulate state-action pairs according to the positive and negative induced policies &#960; &#8853; and &#960; . For this time horizon, we collect rewards for the two different trajectories.</p><p>More specifically, from a given starting state x 0 , a (real) trajectory is simulated to update the policy parameters &#952;. At each epoch k of the parameter update &#952; k , the simulator (modeled as (S(= X ), A, T , r, &#947;)) is called two times to simulate two different (phantom <ref type="foot">7</ref> ) trajectories. These trajectories correspond to the random Monte-Carlo roll-outs used to estimate the Q-functions with two different policies, the positive and negative policy measure, and hence the stochastic gradient of the expected reward function. Let T denote a geometrically distributed random variable: T &#8764; Geom(1&#947;) where &#947; is the discount factor. Let the path-wise cost be defined by</p><p>. Discussion: Algorithm 2 with Algorithm 1 is the stochastic gradient algorithm that is used to update the policy parameters. The simulation consists of a single simulation (real trajectory) to update the parameters and multiple phantom simulations to estimate the gradient of the expected reward function. The two phantom trajectories correspond to different polices and not different models, starting from the system's state represented by the state corresponding to the real trajectory. The stochastic gradient computation is summarized in three steps: For a fixed initial state-(i) Simulate two phantom initial actions from the measures obtained using Jordan decomposition, i.e, &#181; &#952; k (&#8226;|s 0 ) and &#181; &#8853; &#952; k (&#8226;|s &#8853; 0 ). (ii) Simulate a geometric random variable T k , and (iii) Perform Monte Carlo roll-outs of length T k -1 (i.e, simulate and feed actions to the simulator and collect the rewards) using the system policy derived from old parameters, i.e using</p><p>}. The merit of using these random horizons for estimation of the Q function, as summarized in Algorithm 1, is that one may establish that it is an unbiased estimate in the infinitehorizon discounted case, as we summarize in the following theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 2. For a geometric r.v T , let the approximate state-action value function (Q-function) be defined by</head><p>T denote a geometrically distributed random variable. Then,</p><p>The proof can be found in <ref type="bibr">[27]</ref>. Now that we may obtain unbiased samples of the action-value function, we shift focus to how to compute the stochastic gradients needed for policy search based on Jordan decomposition (Theorem 1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Stochastic Gradient Algorithm</head><p>With the estimation of the action-value function addressed, we now discuss how we can sample the former factor: the signed measure gradients. Specifically, Theorem 1 can be used to effectively compute the gradient given access to an oracle/simulator that may generate state-action-reward triples. It is well known that one only needs to compute estimates of the gradient that are unbiased in expectation to ensure convergence of the iterates to a stationary point <ref type="bibr">[7]</ref>. This results in a modification of the gradient expression as in REINFORCE algorithm <ref type="bibr">[11]</ref>, <ref type="bibr">[7]</ref>, which is a stochastic gradient, for computing the optimal policy of the reinforcement learning problem. Let E T denote the expectation with respect to the geometric distribution.</p><p>Using Theorem 2 and Fubini's Theorem <ref type="bibr">[28]</ref>, the gradient in <ref type="bibr">(7)</ref> can be rewritten to make it implementable on a simulator:</p><p>We have from Theorem 2 and ( <ref type="formula">14</ref>),</p><p>Here the initial state simulated from the ergodic measure is x 0 &#8764; &#960; &#181; &#952; (x), and the policies that simulate the two trajectories are:</p><p>Here the initial actions are simulated from the decomposed measures and the parametrized policy is used for the remainder of the trajectory simulation. Here <ref type="bibr">(15)</ref> is the (stochastic) gradient estimate for a random path length T and ( <ref type="formula">16</ref>) is the (stochastic) gradient estimate Algorithm 2 Policy Gradient with Jordan Decomposition (PG-JD)</p><p>Input: System state x k+1 , parameter vector &#952; k , and continuous random policy &#181; &#952; k .</p><p>Output: Parameter &#952; k+1 and next system input a k+1 &#8764; &#181; &#952; k+1 .</p><p>Step 1. Simulate T k &#8764; Geom(1&#947;), i.e., P (T k = t) = (1&#947;)&#947; t . Define the initial conditions: s &#8853; 0 , s 0 = x k+1 . Define: &#956;&#8853;</p><p>} as the policy for trajectory 1.</p><p>Define:</p><p>} as the policy for trajectory 2.</p><p>Step 2. Simulate</p><p>using a realization T z . Using the estimates ( <ref type="formula">16</ref>) that are computable using Algorithm 1 to estimate the Q function with respect to the signed measures, then, we may write out an iterative stochastic gradient method to optimize &#952; with respect to the value function as</p><p>The overall policy search routine is summarized as Algorithm 2. Its convergence and variance properties are discussed in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONVERGENCE, COMPLEXITY, &amp; VARIANCE ANALYSIS</head><p>In this section, we discuss a few properties of the stochastic gradient ascent algorithm derived using weak derivatives, namely, convergence, the iteration complexity, sample complexity, and the variance of the resulting gradient estimates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Convergence Analysis</head><p>We now analyze the convergence of the PG-JD algorithm (Algorithm 2 ), establishing that the stochastic gradient estimates obtained from the algorithm are unbiased estimates of the true gradient, and that the parameter sequence <ref type="bibr">(17)</ref> converges almost surely to a stationary point of the value function (4). To do so, some assumptions are required which we state next.</p><p>1) Assumptions: (i) The reward function<ref type="foot">foot_7</ref> r(x, a) is bounded Lipschitz, i.e,</p><p>&#968;-irreducible, positive Harris recurrent, and geometrically ergodic. (iv) The continuous policy &#181; &#952; (a|x) is Lipschitz, i.e,</p><p>for all &#952; &#8712; &#920;, and n, m &gt; 0. Assumptions (i) -(iii) are model assumptions, whereas Assumptions (iv) -(vi) impose restricts about how the algorithm behaves. Assumption (i) is standard, and tied to learnability of the problem. Assumption (ii) is a continuity assumption on the transition law that is easily satisfied by most physical systems. Assumption (iii) makes sure that for every policy &#181; &#952; , there exists a unique invariant (stationary) measure and the Markov chain reaches stationarity geometrically fast; see <ref type="bibr">[30]</ref>. All the results hold without the transition law being geometrically ergodic. Assuming geometric ergodicity makes simulating from the ergodic measure (in Algorithm 2, Sec.IV) more meaningful. Regarding the algorithmic conditions: Assumptions (iv)-(v) are standard in stochastic gradient methods; see <ref type="bibr">[31]</ref>. Assumption (vi) says that the stochastic gradient is always bounded by the true gradient, which can grow unbounded with &#952;. This assumption makes sure that the martingale noise of the stochastic gradient is bounded by the true gradient; see <ref type="bibr">[31]</ref>.</p><p>Proposition 1. Under Assumption (i), the expected cost J(&#952;) in the reinforcement learning problem ( <ref type="formula">4</ref>) is a bounded realvalued function, i.e,</p><p>The following result makes sure that the stochastic gradient estimates so obtained are representative of the true gradient. Theorem 3. The stochastic gradient obtained in ( <ref type="formula">16</ref>) is an unbiased estimate of the true gradient &#8711;J(&#952;), i.e, E &#8711;J(&#952;) = &#8711;J(&#952;).</p><p>(</p><p>Discussion: The proof can be found in <ref type="bibr">[27]</ref>. Theorem 3 says that the estimates of the stochastic gradient are unbiased 9 As in <ref type="bibr">[29]</ref>, K(&#965;, &#957;) denotes the Kantorovich distance between probability distributions &#965; and &#957;. It is given by:</p><p>in expectation. This is required to ensure the almost sure convergence of the iterates to a stationary point <ref type="bibr">[7]</ref>. (20)</p><p>Discussion: The proof can be found in <ref type="bibr">[27]</ref>. The expected cost function J(&#952;), under model assumptions, is continuous and L-Lipschitz; see [ <ref type="bibr">Chapter 7] [32]</ref> and <ref type="bibr">[29]</ref>. Theorem 4 says that the sequence of iterates {&#952; k } converges to &#952; * with probability one, and since J(&#952;) is a continuous function, J(&#952; k ) converges to J(&#952; * ) with probability one. The gradient (which can be unbounded) at iterates {&#952; k } is such that &#8711;J(&#952; * ) = 0 with probability one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Sample Complexity</head><p>In this section, we consider the convergence rate analysis of the PG-JD algorithm. We choose the stepsize to be k = k -b for some parameter b &#8712; (0, 1). Since the optimization of J(&#952;) is generally non-convex, we consider the convergence rate in terms of a metric of non-stationarity, i.e., the norm of the gradient &#8711;J(&#952;) 2 . The following theorem considers a step-size that diminishes more slowly than Assumption (v), which yields a O(1/ &#8730; k) rate for the decrement of the expected gradient norm square E &#8711;J(&#952; k ) 2 . Theorem 5. Let &#952; k k&#8805;0 be the sequence of parameters of the policy &#181; &#952; k generated by Algorithm 2. Let the stepsize be k = k -b for b &#8712; (0, 1) and &#8710; = min &#949;, &#951; for some &#949;, &#951; &gt; 0. Let</p><p>denote the number of iteration steps for the norm of the expected cost to come within the error neighborhood. Then, under Assumptions (i) -(iv), (vi)</p><p>where optimizing the complexity bound over b, we have b = 1/2. Therefore, K &#8710; = O(&#8710; -2 ).</p><p>Discussion: See <ref type="bibr">[27]</ref> for proof. Theorem 5 characterizes the iteration complexity, which is a measure of the number of iteration steps of the algorithm are required to settle down on a stationary point of the value function. The iteration complexity is O(1/ &#8730; k) showing that the convergence rate is similar to the stochastic gradient methods for non-convex settings. We emphasize that despite the re-invention of such proofs in machine learning, in fact much more general analysis has been developed in the literature. For example Proposition 5, pg 294 in <ref type="bibr">[33]</ref> gives general L q error bounds for q/geq2 in the presence of Markovian noise (whereas our setting is i.i.d. noise). Our intention is to identify these rates in the context and language of modern RL. Corollary 6. Let &#947; denote the discount factor and K &#8710; denote the iteration complexity. The average sample complexity M &#8710; &#947; using Algorithm 2 is given as:</p><p>Discussion: The proof can be found in <ref type="bibr">[27]</ref>. Corollary 6 characterizes the sample complexity, which is a measure of the number of the expected total number of actions and states realized. Higher the discount factor &#947;, longer the two (random) Monte-Carlo roll-outs (trajectories) that need to simulated, and hence higher the sample complexity. Together the complexity results, Theorem 5 and Corollary 6, provide an estimate of the duration and expected number of simulations to learn a stationary solution for the reinforcement learning task considered.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Variance Analysis</head><p>In this section, we provide an analysis of the variance of the stochastic gradient estimates obtained using weak derivatives and score function approaches. Since the Q-function estimation in the computation of the gradient is performed using random Monte Carlo roll-outs as in <ref type="bibr">[21]</ref>, the stochastic gradient obtained is a function of the geometric random variable T that characterizes the roll-out (trajectory) length. To obtain a comparison of the different methods -weak derivatives and score function -we consider the expected variance of the gradient estimates. The proof of Theorem 7 can be found in <ref type="bibr">[27]</ref>. The proof of Theorem 8 is similar and hence omitted.</p><p>Theorem 7. The expected variance of the gradient estimates &#8711;J obtained using weak derivatives is given as:</p><p>where</p><p>Theorem 8. The expected variance of the gradient estimates &#8711;J, if score function is used instead of weak derivatives, is given as:</p><p>where G SF = E (x,a)&#8764;&#181; &#952; (a|x) &#8711;&#181; &#952; (a|x) 2 .</p><p>Corollary 9. For the Gaussian policy &#181; &#952; (&#8226;|x) = N (&#952; &#966;(x), &#963; 2 ), we have</p><p>Hence, the maximum expected variance of the gradient estimates using weak derivatives is smaller than those obtained using the score function method. k=0 &#947; k r(x k , a k )} is evaluated over 50 trajectories with &#947; = 0.97. Observe that the discounted return is higher on average using Monte-Carlo PG-JD as opposed to PG-SF. It can be attributed to algorithm iterates converging to a "better" stationary point due to smaller variance in the gradient estimates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. NUMERICAL STUDIES</head><p>In this section, we present a simple experiment using PG-JD algorithm on the Pendulum environment in OpenAI gym <ref type="bibr">[34]</ref>. The performance is compared with Monte Carlo Policy Gradient using Score Function (PG-SF) which is akin to REINFORCE <ref type="bibr">[35]</ref> with random roll-out horizons; see Fig. <ref type="figure">1</ref>. In the simulation environment, the pendulum starts at a random position, and the goal is to swing it up so that it stays upright. The environment state is a vector of dimension three, i.e., x k = (cos(&#981; k ), sin(&#981; k ), &#966;k ) , where &#981; k is the angle between the pendulum and the upright direction, and &#966;k is the derivative of &#981; k . The action a k is a one-dimensional scalar modified using a tanh-function, and represents the joint effort. The received reward r(x k , a k ) is given as</p><p>which lies in [-16.2734, 0], &#981; k is normalized between [-&#960;, &#960;] and a k lies in [-2, 2]. The transition dynamics are specified by Newton's Second Law of Motion. We use Gaussian policy &#960; &#952; , which is parameterized as &#960; &#952; (&#8226;|x) = N (&#952; T &#966;(x), &#963; 2 ), where &#963; = 1.0 and &#966;(x)(= x) being the feature vector. The policy is a stationary policy (timehomogeneous) as it is well known <ref type="bibr">[6]</ref> to be sufficient for infinite or random horizon discounted MDP problems.</p><p>Observe that the discounted return is higher on average using PG-JD as opposed to PG-SF, which may attributable to the variance-reduced properties of the policy gradient estimates using signed measures as compared with the score function.</p><p>Remark: It is noted that for common parametrizations of the mean of the Gaussian policy <ref type="bibr">[12]</ref>, for example like linear -&#952; T &#966;(s), the score function is unbounded with respect to &#952; with the expression being (a-&#952; T &#966;(s)) &#963; 2 &#966;(s). This results in convergence issues in policy gradient algorithms for unbounded &#952; and unbounded state spaces. However, using Jordan decomposition, even with linear parametrization and unboundedness, the convergence of the policy gradient algorithm is ensured due to the absence of explicit function of &#952;.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Q-function is also known as the state-action value function<ref type="bibr">[7]</ref>. It gives the expected return for a choice of action in a given state.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>The Hahn-Jordan decomposition<ref type="bibr">[15]</ref> of signed measures is a specific type of weak derivative form -this expresses the derivative of a measure as the weighted difference of orthogonal measures. For example, the gradient of a Gaussian policy<ref type="bibr">[12]</ref> can be expressed as a (scaled) difference of two Rayleigh policies.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Iteration complexity is a measure of the number of changes of the unknown parameter. Sample complexity includes the additional simulations required to estimate the continuous state-action Q-function using Monte Carlo roll-out with random path lengths.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>Authorized licensed use limited to: Cornell University Library. Downloaded on June 15,2020 at 00:37:33 UTC from IEEE Xplore. Restrictions apply.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>The probability density function corresponding to Rayleigh distribution is:f (x) = x &#963; 2 &#8226; exp x 2 2&#963; 2 , x &#8805; 0.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>&#967;(&#8226;) denotes the indicator function.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>Here the word "phantom" is used to refer to the actions on the simulator.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_7"><p>Let the product space X &#215; A be equipped with the taxi-cab norm:d X A ((x 1 , a 1 ), (x 2 , a 2 )) = d X (x 1 , x 2 ) + d A (a 1 , a 2 ) &#8704;(x 1 , x 2 , a 1 , a 2 ) &#8712; X 2 &#215; A 2 ,where d (&#8226;) denotes the corresponding metric on the Euclidean space.</p></note>
		</body>
		</text>
</TEI>
