<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>State Entropy Maximization with Random Encoders for Efficient Exploration</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021 July</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10300403</idno>
					<idno type="doi"></idno>
					<title level='j'>International Conference on Machine Learning</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Younggyo Seo</author><author>Lili Chen</author><author>Jinwoo Shin</author><author>Honglak Lee</author><author>Pieter Abbeel</author><author>Kimin Lee</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Conveying complex objectives to reinforcementlearning (RL) agents can often be difficult, involving meticulous design of reward functionsthat are sufficiently informative yet easy enoughto provide. Human-in-the-loop RL methods allow practitioners to instead interactively teachagents through tailored feedback; however, suchapproaches have been challenging to scale sincehuman feedback is very expensive. In this work,we aim to make this process more sample- andfeedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on thestrengths of both feedback and off-policy learning. Specifically, we learn a reward model byactively querying a teacher’s preferences betweentwo clips of behavior and use it to train an agent.To enable off-policy learning, we relabel all theagent’s past experience when its reward modelchanges. We additionally show that pre-trainingour agents with unsupervised exploration substantially increases the mileage of its queries. Wedemonstrate that our approach is capable of learning tasks of higher complexity than previouslyconsidered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our methodis able to utilize real-time human feedback to effectively prevent reward exploitation and learnnew behaviors that are difficult to specify withstandard reward functions.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Deep reinforcement learning (RL) has emerged as a powerful method whereby agents learn complex behaviors on their own through trial and error <ref type="bibr">(Kohl &amp; Stone, 2004;</ref><ref type="bibr"/> Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). <ref type="bibr">Kober &amp; Peters, 2011;</ref><ref type="bibr">Kober et al., 2013;</ref><ref type="bibr">Silver et al., 2017;</ref><ref type="bibr">Andrychowicz et al., 2020;</ref><ref type="bibr">Kalashnikov et al., 2018;</ref><ref type="bibr">Vinyals et al., 2019)</ref>. Scaling RL to many applications, however, is yet precluded by a number of challenges. One such challenge lies in providing a suitable reward function.</p><p>For example, while it may be desirable to provide sparse rewards out of ease, they are often insufficient to train successful RL agents. Thus, to provide adequately dense signal, real-world problems may require extensive instrumentation, such as accelerometers to detect door opening <ref type="bibr">(Yahya et al., 2017)</ref>, thermal cameras to detect pouring <ref type="bibr">(Schenck &amp; Fox, 2017)</ref> or motion capture for object tracking <ref type="bibr">(Kormushev et al., 2010;</ref><ref type="bibr">Akkaya et al., 2019;</ref><ref type="bibr">Peng et al., 2020)</ref>. Despite these costly measures, it may still be difficult to construct a suitable reward function due to reward exploitation. That is, RL algorithms often discover ways to achieve high returns by unexpected, unintended means. In general, there is nuance in how we might want agents to behave, such as obeying social norms, that are difficult to account for and communicate effectively through an engineered reward function <ref type="bibr">(Amodei et al., 2016;</ref><ref type="bibr">Shah et al., 2019;</ref><ref type="bibr">Turner et al., 2020)</ref>. A popular way to avoid reward engineering is through imitation learning, during which a learner distills information about its objectives or tries to directly follow an expert <ref type="bibr">(Schaal, 1997;</ref><ref type="bibr">Ng et al., 2000;</ref><ref type="bibr">Abbeel &amp; Ng, 2004;</ref><ref type="bibr">Argall et al., 2009)</ref>. While imitation learning is a powerful tool, suitable demonstrations are often prohibitively expensive to obtain in practice <ref type="bibr">(Calinon et al., 2009;</ref><ref type="bibr">Pastor et al., 2011;</ref><ref type="bibr">Akgun et al., 2012;</ref><ref type="bibr">Zhang et al., 2018)</ref>.</p><p>In contrast, humans often learn fairly autonomously, relying on occasional external feedback from a teacher. Part of what makes a teacher effective is their ability to interactively guide students according to their progress, providing corrective or increasingly advanced instructions as needed. Such an interactive learning process is also alluring for artificial agents since the agent's behavior can naturally be tailored to one's preference (avoiding reward exploitation) without requiring extensive engineering. This approach is only feasible if the feedback is both practical for a human to provide and sufficiently high-bandwidth. As such, human-in-theloop (HiL) RL <ref type="bibr">(Knox &amp; Stone, 2009;</ref><ref type="bibr">Christiano et al., 2017;</ref><ref type="bibr">MacGlashan et al., 2017)</ref> has not yet been widely adopted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>!</head><p>Figure <ref type="figure">1</ref>. Illustration of our method. First, the agent engages in unsupervised pre-training during which it is encouraged to visit a diverse set of states so its queries can provide more meaningful signal than on randomly collected experience (left). Then, a teacher provides preferences between two clips of behavior, and we learn a reward model based on them. The agent is updated to maximize the expected return under the model. We also relabel all its past experiences with this model to maximize their utilization to update the policy (right).</p><p>In this work, we aim to substantially reduce the amount of human effort required for HiL learning. To this end, we present PEBBLE: unsupervised PrE-training and preference-Based learning via relaBeLing Experience, a feedback-efficient RL algorithm by which learning is largely autonomous and supplemented by a practical number of binary labels (i.e. preferences) provided by a supervisor. Our method relies on two main, synergistic ingredients: unsupervised pre-training and off-policy learning (see Figure <ref type="figure">1</ref>). For generality, we do not assume the agent is privy to rewards from its environment. Instead, we first allow the agent to explore using only intrinsic motivation <ref type="bibr">(Oudeyer et al., 2007;</ref><ref type="bibr">Schmidhuber, 2010)</ref> to diversify its experience and produce coherent behaviors. Collecting a breadth of experiences enables the teacher to provide more meaningful feedback, as compared to feedback on data collected in an indeliberate manner. The supervisor then steps in to teach the agent by expressing their preferences between pairs of clips of the agent's behavior <ref type="bibr">(Christiano et al., 2017)</ref>. The agent distills this information into a reward model and uses RL to optimize this inferred reward function.</p><p>Leveraging unsupervised pre-training increases the efficiency of the teacher's initial feedback; however, RL requires a large enough number of samples such that supervising the learning process is still quite expensive for humans. It is thus especially critical to enable off-policy algorithms that can reuse data to maximize the agent's, and thereby human's, efficiency. However, on-policy methods have typically been used thus far for HiL RL because of their ability to mitigate the effects of non-stationarity in reward caused by online learning. We show that by simply relabeling all of the agent's past experience every time the reward model is updated, we can make use and reuse of all the agent's collected experience to improve sample and feedback efficiency by a large margin. Source code and videos are available at <ref type="url">https://sites.google. com/view/icml21pebble</ref>.</p><p>We summarize the main contributions of PEBBLE:</p><p>&#8226; For the first time, we show that unsupervised pre-training and off-policy learning can significantly improve the sample-and feedback-efficiency of HiL RL.</p><p>&#8226; PEBBLE outperforms prior preference-based RL baselines on complex locomotion and robotic manipulation tasks from DeepMind Control Suite (DMControl; <ref type="bibr">Tassa et al. 2018;</ref><ref type="bibr">2020)</ref> and Meta-world <ref type="bibr">(Yu et al., 2020)</ref>.</p><p>&#8226; We demonstrate that PEBBLE can learn behaviors for which a typical reward function is difficult to engineer very efficiently.</p><p>&#8226; We also show that PEBBLE can avoid reward exploitation, leading to more desirable behaviors compared to an agent trained with respect to an engineered reward function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Learning from human feedback. Several works have successfully utilized feedback from real humans to train agents where it is assumed that the feedback is available at all times <ref type="bibr">(Pilarski et al., 2011;</ref><ref type="bibr">MacGlashan et al., 2017;</ref><ref type="bibr">Arumugam et al., 2019)</ref>. Due to this high feedback frequency, these approaches are difficult to scale to more complex learning problems that require substantial agent experience.</p><p>Better suited to learning in complex domains is to learn a reward model so the agent can learn without a supervisor's perpetual presence. One simple yet effective direction in reward learning is to train a classifier that recognizes task success and use it as basis for a reward function <ref type="bibr">(Pinto &amp; Gupta, 2016;</ref><ref type="bibr">Levine et al., 2018;</ref><ref type="bibr">Fu et al., 2018;</ref><ref type="bibr">Xie et al., 2018)</ref>. Positive examples may be designated or reinforced through human feedback <ref type="bibr">(Zhang et al., 2019;</ref><ref type="bibr">Singh et al., 2019;</ref><ref type="bibr">Smith et al., 2020)</ref>. Another promising direction has focused on simply training a reward model via regres-sion using unbounded real-valued feedback <ref type="bibr">(Knox &amp; Stone, 2009;</ref><ref type="bibr">Warnell et al., 2018)</ref>, but this has been challenging to scale because it is difficult for humans to reliably provide a particular utility value for certain behaviors of the RL agent.</p><p>Much easier for humans is to make relative judgments, i.e., comparing behaviors as better or worse. Preference-based learning is thus an attractive alternative because the supervision is easy to provide yet information-rich <ref type="bibr">(Akrour et al., 2011;</ref><ref type="bibr">Pilarski et al., 2011;</ref><ref type="bibr">Akrour et al., 2012;</ref><ref type="bibr">Wilson et al., 2012;</ref><ref type="bibr">Sugiyama et al., 2012;</ref><ref type="bibr">Wirth &amp; F&#252;rnkranz, 2013;</ref><ref type="bibr">Wirth et al., 2016;</ref><ref type="bibr">Sadigh et al., 2017;</ref><ref type="bibr">Biyik &amp; Sadigh, 2018;</ref><ref type="bibr">Leike et al., 2018;</ref><ref type="bibr">Biyik et al., 2020)</ref>. <ref type="bibr">Christiano et al. (2017)</ref> scaled preference-based learning to utilize modern deep learning techniques-they learn a reward function, modeled with deep neural networks, that is consistent with the observed preferences and use it to optimize an agent using RL. They choose on-policy RL methods <ref type="bibr">(Schulman et al., 2015;</ref><ref type="bibr">Mnih et al., 2016)</ref> since they are more robust to the non-stationarity in rewards caused by online learning. Although they demonstrate that preference-based learning provides a fairly efficient (requiring feedback on less than 1% of the agent's experience) means of distilling information from feedback, they rely on notoriously sampleinefficient on-policy RL, so a large burden can yet be placed on the human. Subsequent works have aimed to improve the efficiency of this method by introducing additional forms of feedback such as demonstrations <ref type="bibr">(Ibarz et al., 2018)</ref> or non-binary rankings <ref type="bibr">(Cao et al., 2020)</ref>. Our proposed approach similarly focuses on developing a more sample-and feedback-efficient preference-based RL algorithm without adding any additional forms of supervision. Instead, we enable off-policy learning as well as utilize unsupervised pre-training to substantially improve efficiency.</p><p>Unsupervised pre-training for RL. Unsupervised pretraining has been studied for extracting strong behavioral priors that can be utilized to solve downstream tasks efficiently in the context of RL <ref type="bibr">(Daniel et al., 2016;</ref><ref type="bibr">Florensa et al., 2018;</ref><ref type="bibr">Achiam et al., 2018;</ref><ref type="bibr">Eysenbach et al., 2019;</ref><ref type="bibr">Sharma et al., 2020)</ref>. Specifically, agents are encouraged to expand the boundary of seen states by maximizing various intrinsic rewards, such as prediction errors <ref type="bibr">(Houthooft et al., 2016;</ref><ref type="bibr">Pathak et al., 2017;</ref><ref type="bibr">Burda et al., 2019)</ref>, count-based state novelty <ref type="bibr">(Bellemare et al., 2016;</ref><ref type="bibr">Tang et al., 2017;</ref><ref type="bibr">Ostrovski et al., 2017)</ref>, mutual information <ref type="bibr">(Eysenbach et al., 2019)</ref> and state entropy <ref type="bibr">(Hazan et al., 2019;</ref><ref type="bibr">Lee et al., 2019;</ref><ref type="bibr">Hao &amp; Pieter, 2021)</ref>. Such unsupervised pre-training methods allow learning diverse behaviors without extrinsic rewards, effectively facilitating accelerated learning of downstream tasks. In this work, we show that unsupervised pre-training enables a teacher to provide more meaningful signal by showing them a diverse set of behaviors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Preliminaries</head><p>Reinforcement learning. We consider a standard RL framework where an agent interacts with an environment in discrete time. Formally, at each timestep t, the agent receives a state s t from the environment and chooses an action a t based on its policy &#960;. The environment returns a reward r t and the agent transitions to the next state s t+1 .</p><p>The return R t = &#8734; k=0 &#947; k r t+k is the discounted sum of rewards from timestep t with discount factor &#947; &#8712; [0, 1). RL then maximizes the expected return from each state s t .</p><p>Soft Actor-Critic. SAC <ref type="bibr">(Haarnoja et al., 2018)</ref> is an offpolicy actor-critic method based on the maximum entropy RL framework (Ziebart, 2010), which encourages exploration and greater robustness to noise by maximizing a weighted objective of the reward and the policy entropy. To update the parameters, SAC alternates between a soft policy evaluation and a soft policy improvement. At the soft policy evaluation step, a soft Q-function, which is modeled as a neural network with parameters &#952;, is updated by minimizing the following soft Bellman residual:</p><p>where &#964; t = (s t , a t , s t+1 , r t ) is a transition, B is a replay buffer, &#952; are the delayed parameters, and &#945; is a temperature parameter. At the soft policy improvement step, the policy &#960; &#966; is updated by minimizing the following objective:</p><p>SAC enjoys good sample-efficiency relative to its on-policy counterparts by reusing its past experiences. However, for the same reason, SAC is not robust to a non-stationary reward function.</p><p>Reward learning from preferences. We follow the basic framework for learning a reward function r &#968; from preferences in which the function is trained to be consistent with observed human feedback <ref type="bibr">(Wilson et al., 2012;</ref><ref type="bibr">Christiano et al., 2017)</ref>. In this framework, a segment &#963; is a sequence of observations and actions {s k , a k , ..., s k+H , a k+H }. We elicit preferences y for segments &#963; 0 and &#963; 1 , where y is a distribution indicating which segment a human prefers, i.e., y &#8712; {(0, 1), (1, 0), (0.5, 0.5)}. The judgment is recorded in a dataset D as a triple (&#963; 0 , &#963; 1 , y). By following the Bradley-Terry model <ref type="bibr">(Bradley &amp; Terry, 1952)</ref>, we model a preference predictor using the reward function r &#968; as follows:</p><p>where &#963; i &#963; j denotes the event that segment i is preferable to segment j. Intuitively, this can be interpreted as assuming the probability of preferring a segment depends exponentially on the sum over the segment of an underlying reward function. While r &#968; is not a binary classifier, learning r &#968; amounts to binary classification with labels y provided by a supervisor. Concretely, the reward function, modeled as a neural network with parameters &#968;, is updated by minimizing the following loss:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">PEBBLE</head><p>In this section, we present PEBBLE: unsupervised PrEtraining and preference-Based learning via relaBeLing Experience, an off-policy actor-critic algorithm for HiL RL. Formally, we consider a policy &#960; &#966; , Q-function Q &#952; and reward function r &#968; , which are updated by the following processes (see Algorithm 2 for the full procedure):</p><p>&#8226;</p><p>Step 0 (unsupervised pre-training): We pre-train the policy &#960; &#966; only using intrinsic motivation to explore and collect diverse experiences (see Section 4.1).</p><p>&#8226;</p><p>Step 1 (reward learning): We learn a reward function r &#968; that can lead to the desired behavior by getting feedback from a teacher (see Section 4.2).</p><p>&#8226;</p><p>Step 2 (agent learning): We update the policy &#960; &#966; and Q-function Q &#952; using an off-policy RL algorithm with relabeling to mitigate the effects of a non-stationary reward function r &#968; (see Section 4.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226; Repeat</head><p>Step 1 and Step 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Accelerating Learning via Unsupervised Pre-training</head><p>In our setting, we assume the agent is given feedback in the form of preferences between segments. In the beginning of training, though, a naive agent executes a random policy, which does not provide good state coverage nor coherent behaviors. The agent's queries are thus quite limited and likely difficult for human teachers to judge. As a result, it requires many samples (and thus queries) for these methods to show initial progress. Recent work has addressed this issue by means of providing demonstrations; however, this is not ideal since these are notoriously hard to procure <ref type="bibr">(Ibarz et al., 2018)</ref>. Instead, our insight is to produce informative queries at the start of training by utilizing unsupervised pretraining to collect diverse samples solely through intrinsic motivation <ref type="bibr">(Oudeyer et al., 2007;</ref><ref type="bibr">Schmidhuber, 2010)</ref>.</p><p>Specifically, we encourage our agent to visit a wider range of states by using the state entropy H(s) = Algorithm 1 EXPLORE: Unsupervised exploration 1: Initialize parameters of Q &#952; and &#960; &#966; and a replay buffer B &#8592; &#8709; 2: for each iteration do 3: for each timestep t do 4:</p><p>Collect st+1 by taking at &#8764; &#960; &#966; (at|st) 5:</p><p>Compute intrinsic reward r int t &#8592; r int (st) as in (5) 6:</p><p>Store transitions B &#8592; B &#8746; {(st, at, st+1, r int t )} 7:</p><p>end for 8:</p><p>for each gradient step do 9:</p><p>Sample minibatch { sj, aj, sj+1, r int j } B j=1 &#8764; B 10:</p><p>Optimize L SAC critic in (1) and L SAC act in (2) with respect to &#952; and &#966; 11:</p><p>end for 12: end for 13: return B, &#960; &#966; -E s&#8764;p(s) [log p(s)] as an intrinsic reward <ref type="bibr">(Hazan et al., 2019;</ref><ref type="bibr">Lee et al., 2019;</ref><ref type="bibr">Hao &amp; Pieter, 2021;</ref><ref type="bibr">Seo et al., 2021)</ref>. By updating the agent to maximize the sum of expected intrinsic rewards, it can efficiently explore an environment and learn how to generate diverse behaviors. However, this intrinsic reward is intractable to compute in most settings. To handle this issue, we employ the simplified version of particle-based entropy estimator <ref type="bibr">(Beirlant et al., 1997;</ref><ref type="bibr">Singh et al., 2003)</ref> (see the supplementary material for more details):</p><p>where H denotes the particle-based entropy estimator and s k i is the k-th nearest neighbor (k-NN) of s i . This implies that maximizing the distance between a state and its nearest neighbor increases the overall state entropy. Inspired by this, we define the intrinsic reward of the current state s t as the distance between s t and its k-th nearest neighbor by following the idea of <ref type="bibr">Hao &amp; Pieter (2021)</ref> that treats each transition as a particle:</p><p>In our experiments, we compute k-NN distances between a sample and all samples in the replay buffer and normalize the intrinsic reward by dividing it by a running estimate of the standard deviation. The full procedure of unsupervised pre-training is summarized in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Selecting Informative Queries</head><p>As previously mentioned, we learn our reward function by modeling the probability that a teacher prefers one sampled segment over another as proportional to the exponentiated sum of rewards over the segment (see Section 3). Ideally, one should solicit preferences so as to maximize expected value of information (EVOI; Savage 1972): the improvement of an agent caused by optimizing with respect to the resulting reward model <ref type="bibr">(Viappiani, 2012;</ref><ref type="bibr">Akrour et al., 2012)</ref>. if iteration % K == 0 then 9:</p><p>for m in 1 . . . M do 10:</p><p>(&#963; 0 , &#963; 1 ) &#8764; SAMPLE() (see Section 4.2) 11:</p><p>Query instructor for y 12:</p><p>Store preference D &#8592; D &#8746; {(&#963; 0 , &#963; 1 , y)} 13:</p><p>end for 14:</p><p>for each gradient step do 15:</p><p>Sample minibatch {(&#963; 0 , &#963; 1 , y)j} D j=1 &#8764; D 16:</p><p>Optimize L Reward in (4) with respect to &#968; 17:</p><p>end for 18:</p><p>Relabel entire replay buffer B using r &#968; 19:</p><p>end if 20:</p><p>for each timestep t do 21:</p><p>Collect st+1 by taking at &#8764; &#960; &#966; (at|st) 22:</p><p>Store transitions B &#8592; B &#8746; {(st, at, st+1, r &#968; (st))} 23:</p><p>end for 24:</p><p>for each gradient step do 25:</p><p>Sample random minibatch {(&#964;j)} B j=1 &#8764; B 26:</p><p>Optimize L SAC critic in (1) and L SAC act in (2) with respect to &#952; and &#966;, respectively 27: end for 28: end for Computing the EVOI is intractable since it involves taking an expectation over all possible trajectories induced by the updated policy. To handle this issue, several approximations have been explored by prior works to sample queries that are likely to change the reward model <ref type="bibr">(Daniel et al., 2014;</ref><ref type="bibr">Christiano et al., 2017;</ref><ref type="bibr">Ibarz et al., 2018)</ref>. In this work, we consider the sampling schemes employed by <ref type="bibr">Christiano et al. (2017)</ref>: (1) uniform sampling and (2) ensemble-based sampling, which selects pairs of segments with high variance across ensemble reward models. We explore an additional third method, entropy-based sampling, which seeks to disambiguate pairs of segments nearest the decision boundary. That is, we sample a large batch of segment pairs and select pairs that maximize H(P &#968; ). We evaluate the effects of these sampling methods in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Using Off-policy RL with Non-Stationary Reward</head><p>Once we learn a reward function r &#968; , we can update the policy &#960; &#966; and Q-function Q &#952; using any RL algorithm. A caveat is that the reward function r &#968; may be non-stationary because we update it during training. <ref type="bibr">Christiano et al. (2017)</ref> used on-policy RL algorithms, TRPO <ref type="bibr">(Schulman et al., 2015)</ref> and A2C <ref type="bibr">(Mnih et al., 2016)</ref>, to address this issue. However, their poor sample-efficiency leads to poor feedbackefficiency of the overall HiL method. In this work, we use an off-policy RL algorithm, which provides for sample-efficient learning by reusing past experiences that are stored in the replay buffer. However, the learning process of off-policy RL algorithms can be unstable because previous experiences in the replay buffer are labeled with previous learned rewards. To handle this issue, we relabel all of the agent's past experience every time we update the reward model. We find that this simple technique stabilizes the learning process and provides large gains in performance (see Figure <ref type="figure">5</ref>(a) for supporting results). The full procedure ofPEBBLE is summarized in Algorithm 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>We design our experiments to investigate the following:</p><p>1. How does PEBBLE compare to existing methods in terms of sample and feedback efficiency?</p><p>2. What is the contribution of each of the proposed techniques in PEBBLE?</p><p>3. Can PEBBLE learn novel behaviors for which a typical reward function is difficult to engineer?</p><p>4. Can PEBBLE mitigate the effects of reward exploitation?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Setups</head><p>We evaluate PEBBLE on several continuous control tasks involving locomotion and robotic manipulation from Deep-Mind Control Suite (DMControl; <ref type="bibr">Tassa et al. 2018;</ref><ref type="bibr">2020)</ref> and Meta-world <ref type="bibr">(Yu et al., 2020)</ref>. In order to verify the efficacy of our method, we first focus on having an agent solve a range of tasks without being able to directly observe the ground truth reward function. Instead, similar to <ref type="bibr">Christiano et al. (2017)</ref> and <ref type="bibr">Ibarz et al. (2018)</ref>, the agent learns to perform a task only by getting feedback from a scripted teacher that provides preferences between trajectory segments according to the true, underlying task reward. Because this scripted teacher's preferences are immediately generated by a ground truth reward, we are able to evaluate the agent quantitatively by measuring the true average return and do more rapid experiments. For all experiments, we report the mean and standard deviation across ten runs.</p><p>We also run experiments with actual human trainers (the authors) to show the benefits of human-in-the-loop RL. First, we show that human trainers can teach novel behaviors (e.g., waving a leg), which are not defined in original benchmarks. Second, we show that agents trained with the handengineered rewards from benchmarks can perform the task in an undesirable way (i.e., the agent exploits a misspecified reward function), while agents trained using human feedback can perform the same task in the desired way. For all experiments, each trajectory segment is presented to the human as a 1 second video clip, and a maximum of one hour of human time is required.</p><p>For evaluation, we compare to <ref type="bibr">Christiano et al. (2017)</ref>, which is the current state-of-the-art approach using the same type of feedback. The primary differences in our method are (1) the introduction of unsupervised pre-training, (2) the accommodation of off-policy RL, and (3) entropy-based sampling. We re-implemented Christiano et al. ( <ref type="formula">2017</ref>) using the state-of-the-art on-policy RL algorithm: PPO <ref type="bibr">(Schulman et al., 2017)</ref>. We use the same reward learning framework and ensemble disagreement-based sampling as they proposed. We refer to this baseline as Preference PPO.</p><p>As an upper bound, since we evaluate against the task reward function, we also compare to SAC <ref type="bibr">(Haarnoja et al., 2018)</ref> and PPO using the same ground truth reward. For our method, we pre-train an agent for 10K timesteps and include these pre-training steps in all learning curves. We do not alter any hyperparameters of the original SAC algorithm and use an ensemble of three reward models. Unless stated otherwise, we use entropy-based sampling. More experimental details including model architectures, sampling schemes, and reward learning are in the supplementary material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Benchmark Tasks with Unobserved Rewards</head><p>Locomotion tasks from DMControl. Figure <ref type="figure">3</ref> shows the learning curves of PEBBLE with 1400, 700 or 400 pieces of feedback<ref type="foot">foot_0</ref> and that of Preference PPO with 2100 or 1400 pieces of feedback on three complex environments: Cheetahrun, Walker-walk and Quadruped-walk. Note that we explicitly give preference PPO an advantage by providing it with more feedback. We find that given a budget of 1400 queries, PEBBLE (green) reaches the same performance as SAC (pink) while Preference PPO (purple) is unable to match PPO (black). That PEBBLE requires less feedback than Preference PPO to match its respective oracle performance corroborates that PEBBLE is indeed more feedbackefficient. These results demonstrate that PEBBLE can enable the agent to solve the tasks without directly observing the ground truth reward function.</p><p>For further analysis, we incorporated our pre-training with Preference PPO (red) and find that it improves performance for Quadruped and Walker. We emphasize that our insight of using pre-training is able to improve both methods in terms of feedback-efficiency and asymptotic performance, but PEBBLE is uniquely positioned to benefit as it is able to utilize unsupervised experience for policy learning.</p><p>Robotic manipulation tasks from Meta-world. One application area in which HiL methods could have significant real-world impact is robotic manipulation, since learning often requires extensive engineering in the real world <ref type="bibr">(Yahya et al., 2017;</ref><ref type="bibr">Schenck &amp; Fox, 2017;</ref><ref type="bibr">Kormushev et al., 2010;</ref><ref type="bibr">Rusu et al., 2017;</ref><ref type="bibr">Akkaya et al., 2019;</ref><ref type="bibr">Peng et al., 2020)</ref>. However, the common approach is to perform goalconditioned learning with classifiers <ref type="bibr">(Singh et al., 2019)</ref>,  which can only capture limited information about what goal states are, and not about how they can be achieved. To study how we can utilize preference-based learning to perform more complex skills, we also consider six tasks covering a range of fundamental robotic manipulation skills from Meta-world (see Figure <ref type="figure">2</ref>). As shown in Figure <ref type="figure">4</ref>, PEB-BLE matches the performance of SAC using the ground truth reward and outperforms Preference PPO, given comparable (and more) feedback, on every task. By demonstrating the applicability of PEBBLE to learning a variety of robotic manipulation tasks, we believe that we are taking an important step towards anyone (non-experts included) being able to teach robots in real-world settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Ablation Study</head><p>Contribution of each technique. In order to evaluate the individual effects of each technique in PEBBLE, we incre- mentally apply unsupervised pre-training and relabeling.</p><p>Figure <ref type="figure">5</ref>(a) shows the learning curves of PEBBLE with 1400 queries on Quadruped-walk. First, we remark that relabeling significantly improves performance because it enables the agent to be robust to changes in its reward model. By additionally utilizing unsupervised pre-training, both sample-efficiency and asymptotic performance of PEB-BLE are further improved because showing diverse behaviors to a teacher can induce a better-shaped reward. This shows that PEBBLE's key ingredients are fruitfully wed, and their unique combination is crucial to our method's success.</p><p>Effects of sampling schemes. We also analyze the effects of different sampling schemes to select queries. Figure <ref type="figure">5(b)</ref> shows the learning curves of PEBBLE with three different sampling schemes: uniform sampling, disagreement sampling and entropy sampling on Quadruped-walk. For this complex domain, we find that the uncertainty-based sampling schemes (using ensemble disagreement or entropy) are superior to the naive uniform sampling scheme. However, we note that they did not lead to extra gains on relatively simple environments, like Walker and Cheetah, similar to observations from <ref type="bibr">Ibarz et al. (2018)</ref> (see the supplementary material for more results).</p><p>Comparison with step-wise feedback. We also measure the performance of PEBBLE by varying the length of segments. Figure <ref type="figure">5</ref>(c) shows that feedback from longer segments (green curve) provide more meaningful signal than step-wise feedback (red curve). We believe that this is because longer segments can provide more context in reward learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Human Experiments</head><p>Novel behaviors. We show that agents can perform various novel behaviors based on human feedback using PEBBLE in Figure <ref type="figure">6</ref>. Specifically, we demonstrate (a) the Cart agent swinging a pole (using 50 queries), (b) the Quadruped agent waving a front leg (using 200 queries), and (c) the Hopper performing a backflip (using 50 queries). We note that the human is indeed able to guide the agent in a controlled way, as evidenced by training the same agent to perform several variations of the same task (e.g., waving different legs or spinning in opposite directions). The videos of all behaviors and examples of selected queries are available in the supplementary material.</p><p>Reward exploitation. One concern in utilizing handengineered rewards is that an agent can exploit unexpected sources of reward, leading to unintended behaviors. Indeed, we find that the Walker agent learns to walk using only one leg even though it achieves the maximum scores as shown in Figure <ref type="figure">7</ref>(b). However, using 200 human queries, we were able to train the Walker to walk in a more natural, humanlike manner (using both legs) as shown in Figure <ref type="figure">7</ref>(a). This result clearly shows the advantage of HiL RL to avoid reward exploitation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion</head><p>In this work, we present PEBBLE, a feedback-efficient algorithm for HiL RL. By leveraging unsupervised pre-training and off-policy learning, we show that sample-and feedbackefficiency of HiL RL can be significantly improved and this framework can be applied to tasks of higher complexity than previously considered by previous methods, including a variety of locomotion and robotic manipulation skills. Additionally, we demonstrate that PEBBLE can learn novel behaviors and avoid reward exploitation, leading to more desirable behaviors compared to an agent trained with respect to an engineered reward function. We believe by making preference-based learning more tractable, PEBBLE may facilitate broadening the impact of RL beyond settings in which experts can carefully craft reward functions to those in which laypeople can likewise utilize the advances of robot learning in the real world.</p><p>Figures <ref type="figure">8</ref> and<ref type="figure">9</ref> show the learning curves of PEBBLE with various sampling schemes. For Quadruped, we find that the uncertainty-based sampling schemes (using ensemble disagreement or entropy) are superior to the naive uniform sampling scheme. However, they did not lead to extra gains on relatively simple environments, like Walker and Cheetah, similar to observations from <ref type="bibr">Ibarz et al. (2018)</ref>. Similarly, on the robotic manipulation tasks, we find little difference in performance for simpler tasks (Drawer Close, Window Open). However, we find that the uncertainty-based sampling schemes generally fare better on the other environments. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Examples of Selected Queries</head><p>Figure <ref type="bibr">10, 11 and 12</ref> show some examples from the selected queries to teach the agents.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>One piece of feedback corresponds to one preference query.</p></note>
		</body>
		</text>
</TEI>
