 Award ID(s):
 1836948
 NSFPAR ID:
 10331367
 Date Published:
 Journal Name:
 Neural Information Processing Systems
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

In multiagent reinforcement learning (MARL), it is challenging for a collection of agents to learn complex temporally extended tasks. The difficulties lie in computational complexity and how to learn the highlevel ideas behind reward functions. We study the graphbased Markov Decision Process (MDP), where the dynamics of neighboring agents are coupled. To learn complex temporally extended tasks, we use a reward machine (RM) to encode each agent’s task and expose reward function internal structures. RM has the capacity to describe highlevel knowledge and encode nonMarkovian reward functions. We propose a decentralized learning algorithm to tackle computational complexity, called decentralized graphbased reinforcement learning using reward machines (DGRM), that equips each agent with a localized policy, allowing agents to make decisions independently based on the information available to the agents. DGRM uses the actorcritic structure, and we introduce the tabular Qfunction for discrete state problems. We show that the dependency of the Qfunction on other agents decreases exponentially as the distance between them increases. To further improve efficiency, we also propose the deep DGRM algorithm, using deep neural networks to approximate the Qfunction and policy function to solve largescale or continuous state problems. The effectiveness of the proposed DGRM algorithm is evaluated by three case studies, two wireless communication case studies with independent and dependent reward functions, respectively, and COVID19 pandemic mitigation. Experimental results show that local information is sufficient for DGRM and agents can accomplish complex tasks with the help of RM. DGRM improves the global accumulated reward by 119% compared to the baseline in the case of COVID19 pandemic mitigation.more » « less

We consider a dynamic Bayesian persuasion setting where a single longlived sender persuades a stream of ``shortlived'' agents (receivers) by sharing information about a payoffrelevant state. The state transitions are Markovian and the sender seeks to maximize the longrun average reward by committing to a (possibly historydependent) signaling mechanism. While most previous studies of Markov persuasion consider exogenous agent beliefs that are independent of the chain, we study a more natural variant with endogenous agent beliefs that depend on the chain's realized history. A key challenge to analyze such settings is to model the agents' partial knowledge about the history information. We analyze a Markov persuasion process (MPP) under various information models that differ in the amount of information the receivers have about the history of the process. Specifically, we formulate a general partialinformation model where each receiver observes the history with an l period lag. Our technical contribution start with analyzing two benchmark models, i.e., the fullhistory information model and the nohistory information model. We establish an ordering of the sender's payoff as a function of the informativeness of agent's information model (with nohistory as the least informative), and develop efficient algorithms to compute optimal solutions for these two benchmarks. For general l, we present the technical challenges in finding an optimal signaling mechanism, where even determining the right dependency on the history becomes difficult. To bypass the difficulties, we use a robustness framework to design a "simple" \emph{historyindependent} signaling mechanism that approximately achieves optimal payoff when l is reasonably large.more » « less

Designing reward functions is a difficult task in AI and robotics. The complex task of directly specifying all the desirable behaviors a robot needs to optimize often proves challenging for humans. A popular solution is to learn reward functions using expert demonstrations. This approach, however, is fraught with many challenges. Some methods require heavily structured models, for example, reward functions that are linear in some predefined set of features, while others adopt less structured reward functions that may necessitate tremendous amounts of data. Moreover, it is difficult for humans to provide demonstrations on robots with high degrees of freedom, or even quantifying reward values for given trajectories. To address these challenges, we present a preferencebased learning approach, where human feedback is in the form of comparisons between trajectories. We do not assume highly constrained structures on the reward function. Instead, we employ a Gaussian process to model the reward function and propose a mathematical formulation to actively fit the model using only human preferences. Our approach enables us to tackle both inflexibility and datainefficiency problems within a preferencebased learning framework. We further analyze our algorithm in comparison to several baselines on reward optimization, where the goal is to find the optimal robot trajectory in a dataefficient way instead of learning the reward function for every possible trajectory. Our results in three different simulation experiments and a user study show our approach can efficiently learn expressive reward functions for robotic tasks, and outperform the baselines in both reward learning and reward optimization.

This paper presents a framework to learn the reward function underlying highlevel sequential tasks from demonstrations. The purpose of reward learning, in the context of learning from demonstration (LfD), is to generate policies that mimic the demonstrator’s policies, thereby enabling imitation learning. We focus on a humanrobot interaction(HRI) domain where the goal is to learn and model structured interactions between a human and a robot. Such interactions can be modeled as a partially observable Markov decision process (POMDP) where the partial observability is caused by uncertainties associated with the ways humans respond to different stimuli. The key challenge in finding a good policy in such a POMDP is determining the reward function that was observed by the demonstrator. Existing inverse reinforcement learning(IRL) methods for POMDPs are computationally very expensive and the problem is not well understood. In comparison, IRL algorithms for Markov decision process (MDP) are well defined and computationally efficient. We propose an approach of reward function learning for highlevel sequential tasks from human demonstrations where the core idea is to reduce the underlying POMDP to an MDP and apply any efficient MDPIRL algorithm. Our extensive experiments suggest that the reward function learned this way generates POMDP policies that mimic the policies of the demonstrator well.more » « less

We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcementlearning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions’ truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions’ truth value. Finally, the algorithm uses a Qlearning procedure for the states of the hypothesis reward machine to determine an optimal policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task.more » « less