We study the problem of inverse reinforcement learning (IRL), where the learning agent recovers a reward function using expert demonstrations. Most of the existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). The algorithm addresses several limitations of existing techniques that do not take the information asymmetry between the expert and the learner into account. First, it adopts causal entropy as the measure of the likelihood of the expert demonstrations as opposed to entropy in most existing IRL techniques, and avoids a common source of algorithmic complexity. Second, it incorporates task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations and may reduce the information asymmetry. Nevertheless, the resulting formulation is still nonconvex due to the intrinsic nonconvexity of the socalled forward problem, i.e., computing an optimal policy given a reward function, in POMDPs. We address this nonconvexity through sequential convex programming and introduce several extensions to solve the forward problem in a scalable manner.This scalability allows computing policies that incorporate memory at the expense of added computational cost yet also outperform memoryless policies. We demonstrate that, even with severely limited data, the algorithm learns reward functions and policies that satisfy the task and induce a similar behavior to the expert by leveraging the side information and incorporating memory into the policy.
more »
« less
This content will become publicly available on May 30, 2024
Reinforcement Learning with Depreciating Assets
A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinitehorizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate an equational characterization of optimality in this context, establish that optimal values and policies can be computed efficiently, and develop a modelfree reinforcement learning approach to obtain optimal policies.
more »
« less
 Award ID(s):
 2146563
 NSFPAR ID:
 10419683
 Date Published:
 Journal Name:
 Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (AAMAS '23)
 Page Range / eLocation ID:
 2628–2630
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Due to information asymmetry, finding optimal policies for Decentralized Partially Observable Markov Decision Processes (DecPOMDPs) is hard with the complexity growing doubly exponentially in the horizon length. The challenge increases greatly in the multiagent reinforcement learning (MARL) setting where the transition probabilities, observation kernel, and reward function are unknown. Here, we develop a general compression framework with approximate common and private state representations, based on which decentralized policies can be constructed. We derive the optimality gap of executing dynamic programming (DP) with the approximate states in terms of the approximation error parameters and the remaining time steps. When the compression is exact (no error), the resulting DP is equivalent to the one in existing work. Our general framework generalizes a number of methods proposed in the literature. The results shed light on designing practically useful deepMARL network structures under the "centralized learning distributed execution" scheme.more » « less

Meila, Marina ; Zhang, Tong (Ed.)Transfer in reinforcement learning is based on the idea that it is possible to use what is learned in one task to improve the learning process in another task. For transfer between tasks which share transition dynamics but differ in reward function, successor features have been shown to be a useful representation which allows for efficient computation of actionvalue functions for previouslylearned policies in new tasks. These functions induce policies in the new tasks, so an agent may not need to learn a new policy for each new task it encounters, especially if it is allowed some amount of suboptimality in those tasks. We present new bounds for the performance of optimal policies in a new task, as well as an approach to use these bounds to decide, when presented with a new task, whether to use cached policies or learn a new policy.more » « less

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.more » « less

null (Ed.)Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the \emph{number of episodes} it takes to provably discover a policy whose value is eps near to that of the optimal value, where the value is measured by the \emph{normalized} cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon  a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only \emph{logarithmically} with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an epsnet for nearoptimal policies whose logcovering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class and enjoys a sample complexity that scales logarithmically with the cardinality of the given policy class. Both may be of independent interest.more » « less