Policy Gradient using Weak Derivatives for Reinforcement Learning

Bhatt, Sujay; Koppel, Alec; Krishnamurthy, Vikram

doi:10.1109/CDC40024.2019.9029403

Citation Details

Policy Gradient using Weak Derivatives for Reinforcement Learning

This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results: (i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be O(1/ k); (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment illustrate the superior performance of the proposed algorithm. more »

Award ID(s):: 1714180

PAR ID:: 10161725

Author(s) / Creator(s):: Bhatt, Sujay; Koppel, Alec; Krishnamurthy, Vikram

Date Published:: 2019-12-01

Journal Name:: 2019 IEEE 58th Conference on Decision and Control

Page Range / eLocation ID:: 5531 to 5537

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/CDC40024.2019.9029403

More Like this