skip to main content


Title: Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning
Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in real-world constraint violations even for actions that are predicted to satisfy all constraints. We propose Conservative and Adaptive Penalty (CAP), a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives. First, CAP inflates predicted costs using an uncertainty-based penalty. Theoretically, we show that policies that satisfy this conservative cost constraint are guaranteed to also be feasible in the true environment. We further show that this guarantees the safety of all intermediate solutions during RL training. Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment. We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms. Code is available at: https://github.com/Redrew/CAP  more » « less
Award ID(s):
1918839
NSF-PAR ID:
10403560
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
36
Issue:
5
ISSN:
2159-5399
Page Range / eLocation ID:
5404 to 5412
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Safe reinforcement learning is extremely challenging--not only must the agent explore an unknown environment, it must do so while ensuring no safety constraint violations. We formulate this safe reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function, where we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call Doubly Optimistic and Pessimistic Exploration (DOPE), and show that it achieves an objective regret $\tilde{O}(|\mathcal{S}|\sqrt{|\mathcal{A}| K})$ without violating the safety constraints during learning, where $|\mathcal{S}|$ is the number of states, $|\mathcal{A}|$ is the number of actions, and $K$ is the number of learning episodes. Our key idea is to combine a reward bonus for exploration (optimism) with a conservative constraint (pessimism), in addition to the standard optimistic model-based exploration. DOPE is not only able to improve the objective regret bound, but also shows a significant empirical performance improvement as compared to earlier optimism-pessimism approaches. 
    more » « less
  2. Safe reinforcement learning (RL) with assured satisfaction of hard state constraints during training has recently received a lot of attention. Safety filters, e.g., based on control barrier functions (CBFs), provide a promising way for safe RL via modifying the unsafe actions of an RL agent on the fly. Existing safety filter-based approaches typically involve learning of uncertain dynamics and quantifying the learned model error, which leads to conservative filters before a large amount of data is collected to learn a good model, thereby preventing efficient exploration. This paper presents a method for safe and efficient RL using disturbance observers (DOBs) and control barrier functions (CBFs). Unlike most existing safe RL methods that deal with hard state constraints, our method does not involve model learning, and leverages DOBs to accurately estimate the pointwise value of the uncertainty, which is then incorporated into a robust CBF condition to generate safe actions. The DOB-based CBF can be used as a safety filter with model-free RL algorithms by minimally modifying the actions of an RL agent whenever necessary to ensure safety throughout the learning process. Simulation results on a unicycle and a 2D quadrotor demonstrate that the proposed method outperforms a state-of-the-art safe RL algorithm using CBFs and Gaussian processes-based model learning, in terms of safety violation rate, and sample and computational efficiency. 
    more » « less
  3. Matni, Nikolai ; Morari, Manfred ; Pappas, George J. (Ed.)
    Safe reinforcement learning (RL) with assured satisfaction of hard state constraints during training has recently received a lot of attention. Safety filters, e.g., based on control barrier functions (CBFs), provide a promising way for safe RL via modifying the unsafe actions of an RL agent on the fly. Existing safety filter-based approaches typically involve learning of uncertain dynamics and quantifying the learned model error, which leads to conservative filters before a large amount of data is collected to learn a good model, thereby preventing efficient exploration. This paper presents a method for safe and efficient RL using disturbance observers (DOBs) and control barrier functions (CBFs). Unlike most existing safe RL methods that deal with hard state constraints, our method does not involve model learning, and leverages DOBs to accurately estimate the pointwise value of the uncertainty, which is then incorporated into a robust CBF condition to generate safe actions. The DOB-based CBF can be used as a safety filter with model-free RL algorithms by minimally modifying the actions of an RL agent whenever necessary to ensure safety throughout the learning process. Simulation results on a unicycle and a 2D quadrotor demonstrate that the proposed method outperforms a state-of-the-art safe RL algorithm using CBFs and Gaussian processes-based model learning, in terms of safety violation rate, and sample and computational efficiency. 
    more » « less
  4. Cyber-physical systems (CPS) are required to satisfy safety constraints in various application domains such as robotics, industrial manufacturing systems, and power systems. Faults and cyber attacks have been shown to cause safety violations, which can damage the system and endanger human lives. Resilient architectures have been proposed to ensure safety of CPS under such faults and attacks via methodologies including redundancy and restarting from safe operating conditions. The existing resilient architectures for CPS utilize different mechanisms to guarantee safety, and currently, there is no common framework to compare them. Moreover, the analysis and design undertaken for CPS employing one architecture is not readily extendable to another. In this article, we propose a timing-based framework for CPS employing various resilient architectures and develop a common methodology for safety analysis and computation of control policies and design parameters. Using the insight that the cyber subsystem operates in one out of a finite number of statuses, we first develop a hybrid system model that captures CPS adopting any of these architectures. Based on the hybrid system, we formulate the problem of joint computation of control policies and associated timing parameters for CPS to satisfy a given safety constraint and derive sufficient conditions for the solution. Utilizing the derived conditions, we provide an algorithm to compute control policies and timing parameters relevant to the employed architecture. We also note that our solution can be applied to a wide class of CPS with polynomial dynamics and also allows incorporation of new architectures. We verify our proposed framework by performing a case study on adaptive cruise control of vehicles.

     
    more » « less
  5. Safe reinforcement learning (RL) has been recently employed to train a control policy that maximizes the task reward while satisfying safety constraints in a simulated secure cyber-physical environment. However, the vulnerability of safe RL has been barely studied in an adversarial setting. We argue that understanding the safety vulnerability of learned control policies is essential to achieve true safety in the physical world. To fill this research gap, we first formally define the adversarial safe RL problem and show that the optimal policies are vulnerable under observation perturbations. Then, we propose novel safety violation attacks that induce unsafe behaviors by adversarial models trained using reversed safety constraints. Finally, both theoretically and experimentally, we show that our method is more effective in violating safety than existing adversarial RL works which just seek to decrease the task reward, instead of violating safety constraints. 
    more » « less