skip to main content


Title: Robust Deep Reinforcement Learning through Adversarial Loss
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent’s inputs, which raises concerns about deploying such agents in the real world. To address this issue, we propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against lp-norm bounded adversarial attacks. Our framework is compatible with popular deep reinforcement learning algorithms and we demonstrate its performance with deep Q-learning, A3C and PPO. We experiment on three deep RL benchmarks (Atari, MuJoCo and ProcGen) to show the effectiveness of our robust training algorithm. Our RADIAL-RL agents consistently outperform prior methods when tested against attacks of varying strength and are more computationally efficient to train. In addition, we propose a new evaluation method called Greedy Worst-Case Reward (GWC) to measure attack agnostic robustness of deep RL agents. We show that GWC can be evaluated efficiently and is a good estimate of the reward under the worst possible sequence of adversarial attacks. All code used for our experiments is available at https://github.com/tuomaso/radial_rl_v2.  more » « less
Award ID(s):
2107189
NSF-PAR ID:
10336941
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Robustness of Deep Reinforcement Learning (DRL) algorithms towards adversarial attacks in real world applications such as those deployed in cyber-physical systems (CPS) are of increasing concern. Numerous studies have investigated the mechanisms of attacks on the RL agent's state space. Nonetheless, attacks on the RL agent's action space (corresponding to actuators in engineering systems) are equally perverse, but such attacks are relatively less studied in the ML literature. In this work, we first frame the problem as an optimization problem of minimizing the cumulative reward of an RL agent with decoupled constraints as the budget of attack. We propose the white-box Myopic Action Space (MAS) attack algorithm that distributes the attacks across the action space dimensions. Next, we reformulate the optimization problem above with the same objective function, but with a temporally coupled constraint on the attack budget to take into account the approximated dynamics of the agent. This leads to the white-box Look-ahead Action Space (LAS) attack algorithm that distributes the attacks across the action and temporal dimensions. Our results showed that using the same amount of resources, the LAS attack deteriorates the agent's performance significantly more than the MAS attack. This reveals the possibility that with limited resource, an adversary can utilize the agent's dynamics to malevolently craft attacks that causes the agent to fail. Additionally, we leverage these attack strategies as a possible tool to gain insights on the potential vulnerabilities of DRL agents. 
    more » « less
  2. Safe reinforcement learning (RL) has been recently employed to train a control policy that maximizes the task reward while satisfying safety constraints in a simulated secure cyber-physical environment. However, the vulnerability of safe RL has been barely studied in an adversarial setting. We argue that understanding the safety vulnerability of learned control policies is essential to achieve true safety in the physical world. To fill this research gap, we first formally define the adversarial safe RL problem and show that the optimal policies are vulnerable under observation perturbations. Then, we propose novel safety violation attacks that induce unsafe behaviors by adversarial models trained using reversed safety constraints. Finally, both theoretically and experimentally, we show that our method is more effective in violating safety than existing adversarial RL works which just seek to decrease the task reward, instead of violating safety constraints. 
    more » « less
  3. We develop a framework to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL) using the on-policy Proximal Policy Optimization (PPO) algorithm which shows better stability than other algorithms and can steadily improve the policies pre-trained with IL. We show that the combination of IL and RL match human performance and that the artificial agents trained with our approach can quickly adapt to reward distribution shift. We finally show that good performance and robustness to reward distribution shift strongly depend on combining allocentric information with an egocentric representation of the environment.

     
    more » « less
  4. Due to repetitive trial-and-error style interactions between agents and a fixed traffic environment during the policy learning, existing Reinforcement Learning (RL)-based Traffic Signal Control (TSC) methods greatly suffer from long RL training time and poor adaptability of RL agents to other complex traffic environments. To address these problems, we propose a novel Adversarial Inverse Reinforcement Learning (AIRL)-based pre-training method named InitLight, which enables effective initial model generation for TSC agents. Unlike traditional RL-based TSC approaches that train a large number of agents simultaneously for a specific multi-intersection environment, InitLight pretrains only one single initial model based on multiple single-intersection environments together with their expert trajectories. Since the reward function learned by InitLight can recover ground-truth TSC rewards for different intersections at optimality, the pre-trained agent can be deployed at intersections of any traffic environments as initial models to accelerate subsequent overall global RL training. Comprehensive experimental results show that, the initial model generated by InitLight can not only significantly accelerate the convergence with much fewer episodes, but also own superior generalization ability to accommodate various kinds of complex traffic environments. 
    more » « less
  5. In this work, we propose an energy-adaptive moni-toring system for a solar sensor-based smart animal farm (e.g., cattle). The proposed smart farm system aims to maintain high-quality monitoring services by solar sensors with limited and fluctuating energy against a full set of cyberattack behaviors including false data injection, message dropping, or protocol non-compliance. We leverage Subjective Logic (SL) as the belief model to consider different types of uncertainties in opinions about sensed data. We develop two Deep Reinforcement Learning (D RL) schemes leveraging the design concept of uncertainty maximization in SL for DRL agents running on gateways to collect high-quality sensed data with low uncertainty and high freshness. We assess the performance of the proposed energy-adaptive smart farm system in terms of accumulated reward, monitoring error, system overload, and battery maintenance level. We compare the performance of the two DRL schemes developed (i.e., multi-agent deep Q-Iearning, MADQN, and multi-agent proximal policy optimization, MAPPO) with greedy and random baseline schemes in choosing the set of sensed data to be updated to collect high-quality sensed data to achieve resilience against attacks. Our experiments demonstrate that MAPPO with the uncertainty maximization technique outperforms its counterparts. 
    more » « less