The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mis- matches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an ε-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.
Accelerating Model Free Reinforcement Learning with Imperfect Model Knowledge in Dynamic Spectrum Access
Current studies that apply reinforcement learning (RL) to dynamic spectrum access (DSA) problems in wireless communications systems are mainly focusing on model-free RL. However, in practice model-free RL requires large number of samples to achieve good performance making it impractical in real time applications such as DSA. Combining model-free and model-based RL can potentially reduce the sample complexity while achieving similar level of performance as model-free RL as long as the learned model is accurate enough. However, in complex environment the learned model is never perfect. In this paper we combine model-free and model-based reinforcement learning, introduce an algorithm that can work with an imperfectly learned model to accelerate the model-free reinforcement learning. Results show our algorithm achieves higher sample efficiency than standard model-free RL algorithm and Dyna algorithm (a standard algorithm that integrating model-based and model-free RL) with much lower computation complexity than the Dyna algorithm. For the extreme case where the learned model is highly inaccurate, the Dyna algorithm performs even worse than the model-free RL algorithm while our algorithm can still outperform the model-free RL algorithm.
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- IEEE Internet of Things Journal
- Page Range or eLocation-ID:
- 1 to 1
- Sponsoring Org:
- National Science Foundation
More Like this
Reinforcement learning-based real-time control of coastal urban stormwater systems to mitigate flooding and improve water qualityReal-time control of stormwater systems can reduce flooding and improve water quality. Current industry real-time control strategies use simple rules based on water quantity parameters at a local scale. However, system-level control methods that also incorporate observations of water quality could provide improved control and performance. Therefore, the objective of this research is to evaluate the impact of local and system-level control approaches on flooding and sediment-related water quality in a stormwater system within the flood-prone coastal city of Norfolk, Virginia, USA. Deep reinforcement learning (RL), an emerging machine learning technique, is used to learn system-level control policies that attempt to balance flood mitigation and treatment of sediment. RL is compared to the conventional stormwater system and two methods of local-scale rule-based control: (i) industry standard predictive rule-based control with a fixed detention time and (ii) rules based on water quality observations. For the studied system, both methods of rule-based control improved water quality compared to the passive system, but increased total system flooding due to uncoordinated releases of stormwater. An RL agent learned controls that maintained target pond levels while reducing total system flooding by 4% compared to the passive system. When pre-trained from the RL agent that learnedmore »
Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning can, at least in theory, directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the POLICY COVER GUIDED POLICY GRADIENT (PC- PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case L infinity assumptions; these include approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in bothmore »
Recent reinforcement learning (RL) approaches have shown strong performance in complex domains such as Atari games, but are often highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. However, designing appropriate shaping rewards is known to be difficult as well as time-consuming. In this work, we address this problem by using natural language instructions to perform reward shaping. We propose the LanguagE-Action Reward Network (LEARN), a framework that maps free-form natural language instructions to intermediate rewards based on actions taken by the agent. These intermediate language-based rewards can seamlessly be integrated into any standard reinforcement learning algorithm. We experiment with Montezuma’s Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60 % more often on average, compared to learning without language.
PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-trainingConveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher’s preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent’s past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard rewardmore »