To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceivedstate attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a metaMDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partiallyobservable turnbased stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NPhard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.
 NSFPAR ID:
 10387287
 Publisher / Repository:
 AAAI Press
 Date Published:
 Journal Name:
 Proceedings of the 36th AAAI Conference on Artificial Intelligence
 Volume:
 36
 Issue:
 5
 ISSN:
 21595399
 Page Range / eLocation ID:
 5260 to 5267
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Pedagogical planners can provide adaptive support to students in narrativecentered learning environments by dynamically scaffolding student learning and tailoring problem scenarios. Reinforcement learning (RL) is frequently used for pedagogical planning in narrativecentered learning environments. However, RLbased pedagogical planning raises significant challenges due to the scarcity of data for training RL policies. Most prior work has relied on limitedsize datasets and offline RL techniques for policy learning. Unfortunately, offline RL techniques do not support ondemand exploration and evaluation, which can adversely impact the quality of induced policies. To address the limitation of data scarcity and offline RL, we propose INSIGHT, an online RL framework for training datadriven pedagogical policies that optimize student learning in narrativecentered learning environments. The INSIGHT framework consists of three components: a narrativecentered learning environment simulator, a simulated student agent, and an RLbased pedagogical planner agent, which uses a reward metric that is associated with effective student learning processes. The framework enables the generation of synthetic data for ondemand exploration and evaluation of RLbased pedagogical planning. We have implemented INSIGHT with OpenAI Gym for a narrativecentered learning environment testbed with rulebased simulated student agents and a deep Qlearningbased pedagogical planner. Our results show that online deep RL algorithms can induce nearoptimal pedagogical policies in the INSIGHT framework, while offline deep RL algorithms only find suboptimal policies even with large amounts of data.

This paper presents a framework to learn the reward function underlying highlevel sequential tasks from demonstrations. The purpose of reward learning, in the context of learning from demonstration (LfD), is to generate policies that mimic the demonstrator’s policies, thereby enabling imitation learning. We focus on a humanrobot interaction(HRI) domain where the goal is to learn and model structured interactions between a human and a robot. Such interactions can be modeled as a partially observable Markov decision process (POMDP) where the partial observability is caused by uncertainties associated with the ways humans respond to different stimuli. The key challenge in finding a good policy in such a POMDP is determining the reward function that was observed by the demonstrator. Existing inverse reinforcement learning(IRL) methods for POMDPs are computationally very expensive and the problem is not well understood. In comparison, IRL algorithms for Markov decision process (MDP) are well defined and computationally efficient. We propose an approach of reward function learning for highlevel sequential tasks from human demonstrations where the core idea is to reduce the underlying POMDP to an MDP and apply any efficient MDPIRL algorithm. Our extensive experiments suggest that the reward function learned this way generates POMDP policies that mimic the policies of the demonstrator well.more » « less

Dasgupta, Sanjoy ; Mandt, Stephan ; Li, Yingzhen (Ed.)Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. longrun average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible averagereward $Q$learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$close to optimal, (ii) estimate optimal average reward with $\epsilon$accuracy, and (iii) estimate the bias function (similar to $Q$function in discounted case) with $\epsilon$accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDPdependent constant. To our knowledge, we provide the first finitesample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.more » « less

Dasgupta, Sanjoy ; Mandt, Stephan ; Li, Yingzhen (Ed.)Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. longrun average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible averagereward $Q$learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $Q$learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $\epsilon$close to optimal, (ii) estimate optimal average reward with $\epsilon$accuracy, and (iii) estimate the bias function (similar to $Q$function in discounted case) with $\epsilon$accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$ samples, (ii) with $\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$ samples, and (iii) with $\tilde{O}(\frac{SA B}{\epsilon^9})$ samples, where $t_\mathrm{mix}$ is the mixing time, and $B > 0$ is an MDPdependent constant. To our knowledge, we provide the first finitesample guarantees that are polynomial in $S, A, t_{\mathrm{mix}}, \epsilon$ for a feasible variant of $Q$learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.more » « less