The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mis- matches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an ε-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.
Accelerating Model Free Reinforcement Learning with Imperfect Model Knowledge in Dynamic Spectrum Access
Current studies that apply reinforcement learning (RL) to dynamic spectrum access (DSA) problems in wireless communications systems are mainly focusing on model-free RL. However, in practice model-free RL requires large number of samples to achieve good performance making it impractical in real time applications such as DSA. Combining model-free and model-based RL can potentially reduce the sample complexity while achieving similar level of performance as model-free RL as long as the learned model is accurate enough. However, in complex environment the learned model is never perfect. In this paper we combine model-free and model-based reinforcement learning, introduce an algorithm that can work with an imperfectly learned model to accelerate the model-free reinforcement learning. Results show our algorithm achieves higher sample efficiency than standard model-free RL algorithm and Dyna algorithm (a standard algorithm that integrating model-based and model-free RL) with much lower computation complexity than the Dyna algorithm. For the extreme case where the learned model is highly inaccurate, the Dyna algorithm performs even worse than the model-free RL algorithm while our algorithm can still outperform the model-free RL algorithm.
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- IEEE Internet of Things Journal
- Page Range or eLocation-ID:
- 1 to 1
- Sponsoring Org:
- National Science Foundation
More Like this
Reinforcement learning-based real-time control of coastal urban stormwater systems to mitigate flooding and improve water qualityReal-time control of stormwater systems can reduce flooding and improve water quality. Current industry real-time control strategies use simple rules based on water quantity parameters at a local scale. However, system-level control methods that also incorporate observations of water quality could provide improved control and performance. Therefore, the objective of this research is to evaluate the impact of local and system-level control approaches on flooding and sediment-related water quality in a stormwater system within the flood-prone coastal city of Norfolk, Virginia, USA. Deep reinforcement learning (RL), an emerging machine learning technique, is used to learn system-level control policies that attempt to balance flood mitigation and treatment of sediment. RL is compared to the conventional stormwater system and two methods of local-scale rule-based control: (i) industry standard predictive rule-based control with a fixed detention time and (ii) rules based on water quality observations. For the studied system, both methods of rule-based control improved water quality compared to the passive system, but increased total system flooding due to uncoordinated releases of stormwater. An RL agent learned controls that maintained target pond levels while reducing total system flooding by 4% compared to the passive system. When pre-trained from the RL agent that learnedmore »
Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning can, at least in theory, directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the POLICY COVER GUIDED POLICY GRADIENT (PC- PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case L infinity assumptions; these include approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in bothmore »
Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in real-world constraint violations even for actions that are predicted to satisfy all constraints. We propose Conservative and Adaptive Penalty (CAP), a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives. First, CAP inflates predicted costs using an uncertainty-based penalty. Theoretically, we show that policies that satisfy this conservative cost constraint are guaranteed to also be feasible in the true environment. We further show that this guarantees the safety of all intermediate solutions during RL training. Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment. We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms. Code is available at: https://github.com/Redrew/CAP
Recent reinforcement learning (RL) approaches have shown strong performance in complex domains such as Atari games, but are often highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. However, designing appropriate shaping rewards is known to be difficult as well as time-consuming. In this work, we address this problem by using natural language instructions to perform reward shaping. We propose the LanguagE-Action Reward Network (LEARN), a framework that maps free-form natural language instructions to intermediate rewards based on actions taken by the agent. These intermediate language-based rewards can seamlessly be integrated into any standard reinforcement learning algorithm. We experiment with Montezuma’s Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60 % more often on average, compared to learning without language.