Despite the potential of reinforcement learning (RL) for building generalpurpose robotic systems, training RL agents to solve robotics tasks still remains challenging due to the difficulty of exploration in purely continuous action spaces. Addressing this problem is an active area of research with the majority of focus on improving RL methods via better optimization or more efficient exploration. An alternate but important component to consider improving is the interface of the RL algorithm with the robot. In this work, we manually specify a library of robot action primitives (RAPS), parameterized with arguments that are learned by an RL policy. These parameterized primitives are expressive, simple to implement, enable efficient exploration and can be transferred across robots, tasks and environments. We perform a thorough empirical study across challenging tasks in three distinct domains with image input and a sparse terminal reward. We find that our simple change to the action interface substantially improves both the learning efficiency and task performance irrespective of the underlying RL algorithm, significantly outperforming prior methods which learn skills from offline expert data. Code and videos at https://mihdalal.github.io/raps/
This content will become publicly available on April 1, 2023
Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration
A major challenge in realworld reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a suboptimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from  not imitating  the offline data, LOGO orients its policy in the manner of the suboptimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, more »
 Award ID(s):
 2045783
 Publication Date:
 NSFPAR ID:
 10327543
 Journal Name:
 International Conference on Learning Representations (ICLR)
 Sponsoring Org:
 National Science Foundation
More Like this


We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsicmore »

Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Humanintheloop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample and feedbackefficient. We present an offpolicy, interactive RL algorithm that capitalizes on the strengths of both feedback and offpolicy learning. Specifically, we learn a reward model by actively querying a teacher’s preferences between two clips of behavior and use it to train an agent. To enable offpolicy learning, we relabel all the agent’s past experience when its reward model changes. We additionally show that pretraining our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by humanintheloop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize realtime human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard rewardmore »

Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Humanintheloop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample and feedbackefficient. We present an offpolicy, interactive RL algorithm that capitalizes on the strengths of both feedback and offpolicy learning. Specifically, we learn a reward model by actively querying a teacher’s preferences between two clips of behavior and use it to train an agent. To enable offpolicy learning, we relabel all the agent’s past experience when its reward model changes. We additionally show that pretraining our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by humanintheloop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize realtime human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard rewardmore »

Network slicing is a promising technology that allows mobile network operators to efficiently serve various emerging use cases in 5G. It is challenging to optimize the utilization of network infrastructures while guaranteeing the performance of network slices according to service level agreements (SLAs). To solve this problem, we propose SafeSlicing that introduces a new constraintaware deep reinforcement learning (CaDRL) algorithm to learn the optimal resource orchestration policy within two steps, i.e., offline training in a simulated environment and online learning with the real network system. On optimizing the resource orchestration, we incorporate the constraints on the statistical performance of slices in the reward function using Lagrangian multipliers and solve the Lagrangian relaxed problem via a policy network. To satisfy the constraints on the system capacity, we design a constraint network to map the latent actions generated from the policy network to the orchestration actions such that the total resources allocated to network slices do not exceed the system capacity. We prototype SafeSlicing on an endtoend testbed developed by using OpenAirInterface LTE, OpenDayLightbased SDN, and CUDA GPU computing platform. The experimental results show that SafeSlicing reduces more than 20% resource usage while meeting SLAs of network slices as compared with othermore »