Meta reinforcement learning (MetaRL) is an approach wherein the experience
gained from solving a variety of tasks is distilled into a metapolicy. The metapolicy,
when adapted over only a small (or just a single) number of steps, is able
to perform nearoptimally on a new, related task. However, a major challenge to
adopting this approach to solve realworld problems is that they are often associated
with sparse reward functions that only indicate whether a task is completed partially
or fully. We consider the situation where some data, possibly generated by a suboptimal
agent, is available for each task. We then develop a class of algorithms
entitled Enhanced MetaRL using Demonstrations (EMRLD) that exploit this
information—even if suboptimal—to obtain guidance during training. We show
how EMRLD jointly utilizes RL and supervised learning over the offline data to
generate a metapolicy that demonstrates monotone performance improvements.
We also develop a warm started variant called EMRLDWS that is particularly
efficient for suboptimal demonstration data. Finally, we show that our EMRLD
algorithms significantly outperform existing approaches in a variety of sparse
reward environments, including that of a mobile robot.
more »
« less
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration
A major challenge in realworld reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a suboptimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from  not imitating  the offline data, LOGO orients its policy in the manner of the suboptimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over stateoftheart approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.
more »
« less
 Award ID(s):
 2045783
 NSFPAR ID:
 10327543
 Date Published:
 Journal Name:
 International Conference on Learning Representations (ICLR)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Despite the potential of reinforcement learning (RL) for building generalpurpose robotic systems, training RL agents to solve robotics tasks still remains challenging due to the difficulty of exploration in purely continuous action spaces. Addressing this problem is an active area of research with the majority of focus on improving RL methods via better optimization or more efficient exploration. An alternate but important component to consider improving is the interface of the RL algorithm with the robot. In this work, we manually specify a library of robot action primitives (RAPS), parameterized with arguments that are learned by an RL policy. These parameterized primitives are expressive, simple to implement, enable efficient exploration and can be transferred across robots, tasks and environments. We perform a thorough empirical study across challenging tasks in three distinct domains with image input and a sparse terminal reward. We find that our simple change to the action interface substantially improves both the learning efficiency and task performance irrespective of the underlying RL algorithm, significantly outperforming prior methods which learn skills from offline expert data. Code and videos at https://mihdalal.github.io/raps/more » « less

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.more » « less

Reinforcement learning (RL) is mechanized to learn from experience. It solves the problem in sequential decisions by optimizing rewardpunishment through experimentation of the distinct actions in an environment. Unlike supervised learning models, RL lacks static inputoutput mappings and the objective of minimization of a vector error. However, to find out an optimal strategy, it is crucial to learn both continuous feedback from training data and the offline rules of the experiences with no explicit dependence on online samples. In this paper, we present a study of a multiagent RL framework which involves a Critic in semioffline mode criticizing over an online ActorCritic network, namely, CriticoverActorCritic (CoAC) model, in finding optimal treatment plan of ICU patients as well as optimal strategy in a combative battle game. For further validation, we also examine the model in the adversarial assignment.more » « less

null (Ed.)Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Humanintheloop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample and feedbackefficient. We present an offpolicy, interactive RL algorithm that capitalizes on the strengths of both feedback and offpolicy learning. Specifically, we learn a reward model by actively querying a teacher’s preferences between two clips of behavior and use it to train an agent. To enable offpolicy learning, we relabel all the agent’s past experience when its reward model changes. We additionally show that pretraining our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by humanintheloop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize realtime human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions.more » « less