<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Better Than Diverse Demonstrators: Reward Decomposition From Suboptimal and Heterogeneous Demonstrations</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>07/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10611671</idno>
					<idno type="doi">10.1109/LRA.2025.3572771</idno>
					<title level='j'>IEEE Robotics and Automation Letters</title>
<idno>2377-3774</idno>
<biblScope unit="volume">10</biblScope>
<biblScope unit="issue">7</biblScope>					

					<author>Chunyue Xue</author><author>Letian Chen</author><author>Matthew Gombolay</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Inverse Reinforcement Learning (IRL) typically involves inferring a reward function from expert demonstrations to enable agents to imitate the demonstrated behavior. However, real-world settings often provide suboptimal and heterogeneous demonstrations, where human demonstrators use diverse strategies and imperfect actions. Yet, we are unaware of any prior work that simultaneously addresses the challenges of IRL, of which demonstrations are both heterogeneous and suboptimal. In this work, we propose a novel approach, REPRESENT (Reward dE-comPosition fRom hEterogeneous Suboptimal dEmoNstraTion), that disentangles the latent intrinsic task reward and the strategyspecific reward from suboptimal and diverse strategies. Our method learns to identify a shared task reward component that generalizes across varying demonstrator preferences while also modeling distinct strategy-specific rewards. By decomposing the common task reward across varied demonstrations, REP-RESENT extracts the core objectives shared by all strategies, enabling the agent to perform better than the demonstrators while preserving individual strategy preferences. We validate our approach on three robotic domains, showing a higher correlation with the true task reward and improved policy performance compared to baselines. These results suggest that REPRESENT can effectively handle suboptimality and heterogeneity, providing a solution for real-world LfD applications to better learn from demonstrations varied in quality and strategy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>A S robots increasingly enter complex environments, tra- ditional robotics methods requiring expert programming for each task become impractical. Learning from Demonstration (LfD), which enables robots to acquire skills directly from human demonstrations, addresses this scalability issue and has been successfully applied to manufacturing <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, healthcare <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, and autonomous driving <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>. However, demonstrations often come from non-experts with varying skills and preferences <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, making them both suboptimal and heterogeneous:</p><p>1) Suboptimality: Typical users provide lower-quality demonstrations compared to experts due to limitations in experience or ability [?], <ref type="bibr">[9]</ref>, hindering policy learning in traditional LfD <ref type="bibr">[7]</ref>. 2) Heterogeneity: Demonstrators exhibit varying preferences or habits in performing the same task <ref type="bibr">[10]</ref>, causing Manuscript received: December, 6, 2024; Revised March, 13, 2025; Accepted <ref type="bibr">May, 11, 2025.</ref> This paper was recommended for publication by Editor Aleksandra Faust upon evaluation of the Associate Editor and Reviewers' comments. This work was supported by NIH 1R01HL157457 and NSF IIS-2340177 1 School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA 30332, {cxue43, letian.chen, matthew.gombolay}@gatech.edu Digital Object Identifier (DOI): see top of this page. ambiguity and inconsistency in naive IRL-based policy learning <ref type="bibr">[11]</ref>. Collecting higher-quality or more demonstrations is often impractical due to limited expert availability and high costs. For example, household robots usually learn from caregivers or family members who provide imperfect demonstrations with different habits. Thus, demonstrations are inherently suboptimal and heterogeneous.</p><p>An effective algorithm addressing both challenges simultaneously is necessary for practical real-world LfD applications. We depict the problem setup with dual need with Fig. <ref type="figure">2</ref>. While previous work has separately addressed these issues, our experiments indicate they struggle when suboptimality and heterogeneity coexist.</p><p>We propose REPRESENT (Reward dEcomPosition fRom hEterogeneous Suboptimal dEmoNstraTion), a novel method to tackle both challenges simultaneously. We consider a scenario with limited, imperfect demonstrations from multiple non-expert users with diverse preferences. Unlike existing methods, REPRESENT extracts a shared latent task reward from suboptimal demonstrations while preserving strategyspecific rewards. This design allows our framework to surpass demonstrator performance by effectively leveraging diverse, imperfect demonstrations. Our contributions include:</p><p>&#8226; Introducing REPRESENT, a novel framework that dis-entangles shared task rewards from strategy-specific elements, effectively handling suboptimal demonstrations while preserving unique strategies.</p><p>&#8226; Demonstrating performance improvements (up to 300%) over prior work across three robotic domains.</p><p>Fig. <ref type="figure">2</ref>: Human demonstrations (circles) reflect different suboptimal strategies for performing a task, each constrained by individual limitations (ability boundary). The underlying optimization space is unknown, and REPRESENT aims to disentangle these distinct strategies and extract a shared, task reward to achieve more-optimal policies (triangles).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>LfD and Inverse Reinforcement Learning (IRL) -LfD <ref type="bibr">[7]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref> has been a widely studied reinforcement learning area in recent years, which allows robots to learn new skills from human teaching without prior knowledge of programming. Behavioral Cloning (BC) <ref type="bibr">[14]</ref> learns policies through supervised learning by directly mapping states to actions using expert demonstration data, which can suffer from distribution shift and limited generalization. BC-RNN is a variant of BC with a Recurrent Neural Network (RNN) policy network, which captures temporal correlations in decision-making and becomes useful in multi-modal learning. Within LfD, IRL methods aim to infer the agent's reward function based on observations of its behavior, given a set of observed trajectories. Adversarial Inverse Reinforcement Learning (AIRL) <ref type="bibr">[15]</ref> and Maximum-entropy IRL (MaxEnt-IRL) <ref type="bibr">[16]</ref> are two popular probabilistic IRL algorithms. These methods work well when the given demonstrations are heterogeneous and (near-)optimal, but generally struggle to produce significantly better policies when heterogeneous and suboptimal demonstrations are given. In contrast, our method addresses these limitations and enables effective learning even when the demonstrations are suboptimal and have diverse strategies.</p><p>Learning from Suboptimal Demonstrations -Recently, methods have been introduced to leverage noisy and imperfect data. Some approaches require a mixed dataset of suboptimal and expert demonstrations, where suboptimal data serves as auxiliary input <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>. However, these methods still depend on the availability of costly expert data. Other methods, such as <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>, can work solely with suboptimal data. Among these methods, Self-Supervised Reward Regression (SSRR) <ref type="bibr">[23]</ref>, and D-REX <ref type="bibr">[24]</ref> have shown promise in learning from suboptimal demonstrations. D-REX utilizes BC along with a ranking approach to infer desired reward functions from suboptimal demonstrations. However, D-REX suffers from an inductive bias issue from the ranking formulation and can be brittle to covariate shift, leading to inaccurate reward estimates. In contrast, SSRR leverages a low-pass filter based on the performance-noise relationship. Although SSRR can significantly improve robustness and accuracy from noisy inputs, it cannot model multi-strategy demonstrations.</p><p>Multi-Task and Multi-Strategy Reward Learning -Learning from heterogeneous demonstrations is another key challenge in LfD, often addressed in multi-task learning. Approaches like Distral <ref type="bibr">[25]</ref> and CARE <ref type="bibr">[26]</ref> capture shared behaviors across tasks by either distilling common policies or creating task-specific representations. However, these methods focus on handling multiple tasks rather than learning diverse strategies for the same task. In contrast, methods like Multi-Style Reward Distillation (MSRD) <ref type="bibr">[27]</ref> aim to disentangle shared task rewards and strategy-specific rewards but require a complex distillation-regularization structure. Fast Lifelong Adaptive IRL (FLAIR) <ref type="bibr">[28]</ref> builds on MSRD to enable rapid adaptation by maintaining a library of strategy prototypes. Nevertheless, these approaches are limited by the performance of demonstrations, and the corresponding learned task and strategy rewards will diverge substantially from the true latent reward function when input demonstrations are not optimal.</p><p>In contrast to previous approaches that focused only on suboptimality or heterogeneity, our approach, REPRESENT, attempts to take a proactive step in constructing a robust and efficient method that can simultaneously handle both challenges. REPRESENT disentangles the shared task reward and strategy-specific rewards, allowing it to improve learning efficiency by leveraging heterogeneous demonstrations while denoising for suboptimal actions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PRELIMINARIES</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Markov Decision Process</head><p>The reinforcement learning problem can be modeled as a Markov Decision Process (MDP), denoted as a 6-tuple (S, A, R, T, &#947;, &#961; 0 ), where S and A are the state space and action space. R(s, a) is the reward given that action a is taken in state s and sometimes can be simplified as R(s). T (s, a, s &#8242; ) represents the transition probability from state s to state s &#8242; after taking action a. &#961; 0 is the initial state distribution. A policy, &#960;(a|s), is the probability of an agent taking action a in state s. The aim of RL is to maximize cumulative discounted reward E &#960; &#8721; T t=0 &#947; t R(s t , a t ) . IRL can also be modeled as an MDP with an unknown reward function. As a type of LfD, the inputs of IRL should be demonstrations from humans or robots, denoted by &#964; = {(s t , a t )}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Reward Function Decomposition</head><p>We propose a decomposition structure based on MSRD <ref type="bibr">[27]</ref> for reward functions. A strategy-combined reward, R i , learned from Strategy-i demonstrations is a linear combination of a general task reward, R 0 , and a specified strategy-only reward, R i , as given by Eq. 1, where &#945; i is a hyperparameter representing the relative weight of the strategy-only reward R i .</p><p>C. Self-Supervised Reward Regression SSRR <ref type="bibr">[23]</ref> is an IRL method to learn from suboptimal demonstrations we partially adopt and improve upon this work.</p><p>Phase I -SSRR generates a set of noisy trajectories from the initial policy learned using suboptimal demonstrations. On each set of the suboptimal demonstrations with Strategy i, we train an IRL model, Noisy Adversarial Inverse Reinforcement Learning (Noisy-AIRL), to obtain an initial reward function R i and an initial policy &#960; i via a GAN(Generative Adversarial Network)-like structure, serving as the starting point for generating the self-supervised data. Different levels of uniform random noise are injected into the initial policy &#960; i to generate noisy policies {&#960; i } and create diverse performance level behaviors. For each noisy policy, a corresponding set of trajectories {&#964; i } = {&#951; i , {s i t }, {a i t }, {r i t }} is generated, where each trajectory &#964; i consists of the noise parameter &#951; i , a sequence of states and actions{s i t }, {a i t }, and the corresponding initial reward r i t = R i (s i t , a i t ). This process results in a selfsupervised dataset that spans different levels of policy performance, providing a rich set of training examples for learning the noise-performance relationship.</p><p>Phase II -SSRR then focuses on characterizing the relationship between the injected noise and the performance of the corresponding trajectories. We fit a sigmoid function to capture the relationship between the noise level and the performance of each noisy policy. The fitting uses the cumulative reward estimates from the initial reward function R i . The sigmoid function is defined as: &#963; (&#951; i ) = c 1+e -k(&#951; i -x 0 )) + y 0 , where &#951; i is the noise parameter, and c, k, x 0 , y 0 are the sigmoid parameters. This curve serves as a low-pass filter, yielding a smooth, noiseperformance curve, which is shown in Fig. <ref type="figure">3</ref> in the "Noisereward curve fitting" block. By using the noise-performance curve, SSRR smooths the estimated rewards from the initial policy and corrects for the biases induced by noisy trajectories. This regularized reward estimate can help to capture the overall reward more accurately, even under varying levels of suboptimality.</p><p>To denoise and learn from the suboptimal demonstrations, we adopt the above two rephases before our novel decomposition reward regression architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. METHOD</head><p>We present REPRESENT (Reward dEcomPosition fRom hEterogeneous Suboptimal dEmoNstraTion), a novel IRL framework (Fig. <ref type="figure">3</ref>) that learns reward functions by disentangling shared task rewards from strategy-specific components in suboptimal, heterogeneous demonstrations. REPRESENT introduces (1) a reward network that separates task and strategyspecific rewards, and (2) a custom loss function that isolates task from strategy rewards. Our approach effectively captures complex demonstration data, enabling policies to outperform diverse, non-expert demonstrators. In the following sections, we provide details on the problem setup, reward network design, the custom loss function, and how our framework learns from diverse data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Problem Setup</head><p>In our settings, assuming the given demonstrations are distinguished by their providers, and each provider is associated with one specific strategy, we denote the demonstrations from User i with Strategy i as D i = {&#964; i 1 , &#964; i 2 , ..., &#964; i N }, and the combined full dataset with M varying strategies is D = {D 1 , D 2 , ..., D M }, where i &#8712; {1, 2, ..., M} is the strategy index, N is the number of trajectories for one strategy, and M is the number of different demonstrators/strategies.</p><p>Our objective is to learn strategy-combined reward function R i for each strategy given this set of demonstration trajectories and distill a disentangled task reward R 0 that can extract the common points of distinct strategy-combined reward functions and mimic the environment reward, together with each strategy-only reward functions R i . The policies acquired using these task reward functions or strategy-combined functions can achieve a performance exceeding that of the demonstrator.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Reward Regression with Distillation</head><p>In our method, we first have two phases to handle the suboptimality for each set of demonstrations with the same strategy by estimating noise-performance curves. Phase 1 generates self-supervised data through noisy policies, starting with an initial reward learned from suboptimal demonstrations for each strategy. In Phase 2, we characterize the relationship between noise and performance by fitting a noise-performance curve for each strategy, using a sigmoid function to smooth the reward estimates.</p><p>The key innovation in our approach is Phase 3: a multicomponent reward structure that separately models task and strategy rewards. This structure is necessary to learn from both suboptimal and heterogeneous demonstrations, as it enables us to capture the shared task objectives and the distinct behaviors generated by diverse strategies, consolidating common task reward information into the shared R 0 , while preserving strategyspecific nuances. Jointly employing all the noise-performance curves from the first two phrases, we regress the combined reward functions of each strategy parameterized by trajectory states and actions, each having two distinct components: Shared Task Reward Network and Strategy-Specific Reward Networks.</p><p>For Shared Task Reward Network, we introduce a shared reward network R 0 that captures the intrinsic task reward common to all demonstrations. This network is designed to learn the core task-related rewards that are invariant to individual strategies. And for Strategy-Specific Reward Networks, in parallel, we use separate strategy reward networks R i for Fig. <ref type="figure">3</ref>: Using the Hopper domain as an example, for M different strategies, we have M substructures to get the noise-reward curve, which are jointly trained to learn the task rewards and strategy-specified rewards. The learned rewards are then used for task and strategy policy training. each distinct strategy i. These networks model the individual preferences or heuristics exhibited by different demonstrators. Unlike prior work that assumes a single reward structure, our method allows for multiple strategy rewards, which capture variations in human performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Loss Function</head><p>To ensure the disentanglement of task and strategy rewards, we design a custom loss function. The overall loss function is given in Equation <ref type="formula">2</ref>, where &#955; SSRR , &#955; reg , &#955; BCD are weighting parameters that balance the different components, &#952; i = {&#952; 0 , &#952; i } are the full parameters of strategy-combined reward neural networks for each strategy i, &#952; 0 are the parameters of task reward network, and &#952; i are parameters of the reward-specified reward network.</p><p>The objective includes 1) a performance loss adapted from SSRR L SSRR , which aligns the complete strategy-combined reward network with the regularized reward estimates obtained in Phase 2; 2) an L2-regularization loss L reg on strategy-only reward's output to make R &#952; i close to 0 and R i close to R 0 , i.e., prioritizing optimizing the intrinsic task reward, according to Eq. 1; 3) a Between-Class Discrimination (BCD) loss from FLAIR <ref type="bibr">[28]</ref> to increase the strategy reward's discriminability between different strategies. The BCD loss encourages Strategy i's reward to give a lower reward to other strategy demonstration &#964; m , which encourages the strategy rewards to learn nonsimilar, archetypal strategies that can cover a diverse range of behaviors and more easily discriminate and assign strategy labels among those behaviors. We note that there exists a trade-off between L2-regularization loss and BCD loss, which we tune empirically.</p><p>Once the task and strategy rewards are learned, we apply a standard reinforcement learning algorithm (e.g., Soft Actor-Critic (SAC) <ref type="bibr">[29]</ref>) to obtain a policy &#960; * that maximizes the strategy-combined reward R i and the task reward R 0 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Algorithm</head><p>REPRESENT is summarized in Algorithm 1. Starting with a heterogeneous and suboptimal demonstration dataset D, we initialize the task network &#952; 0 and strategy-specific networks &#952; i . Phase 1 (lines 3-6) applies Noisy-AIRL to learn the distinct and suboptimal demonstrations. For each strategy's dataset D i , we infer the initial latent reward function r i . Phase 2 (lines 7-11) involves sampling noisy trajectories at various noise levels to infer a noise-performance curve.</p><p>Phase 3 (lines 12-16) performs reward regression and decomposition jointly by updating &#952; 0 and &#952; i iteratively using the curve and the loss in Eq. 2. This step ensures &#952; 0 generalizes across strategies while each &#952; i captures strategy-specific details. Upon convergence, the final output is the task network &#952; 0 and each strategy-specific network &#952; i , representing the distilled reward functions across strategies.</p><p>Algorithm 1 REPRESENT 1: Input: Heterogeneous and suboptimal demonstration dataset D = {D 1 , D 2 , ..., D M }, where D i = {&#964; i 1 , &#964; i 2 , ..., &#964; i N }, i &#8712; {1, 2, ..., M} 2: Initialize: &#952; 0 , &#952; i for each strategy i, &#8704;i &#8712; {1, 2, ..., M} 3: Phase 1: Apply Noisy-AIRL to handle heterogeneous and suboptimal demonstrations 4: for each strategy demonstration set D i &#8712; D do 5: Infer latent reward function R using Noisy-AIRL 6: end for 7: Phase 2: Sample noisy trajectories and construct noiseperformance curve 8: for each noise level i do 9: Sample noisy trajectories {&#960; i } with noise level &#951; j 10: Generate noise-performance curve based on {&#960; i } 11: end for 12: Phase 3: Joint reward regression and decomposition using strategy-specified and task networks 13: while not converged do 14: Compute reward regression using &#952; i and &#952; 0 15:</p><p>Update parameters &#952; i and &#952; 0 with Eq. 2 16: end while 17: Output: &#952; 0 , &#952; i for each strategy i V. EMPIRIMENTAL SETUP In this section, we describe the experiment settings, including benchmarks and baselines we choose and the demonstration generation process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Benchmark and Baseline</head><p>We test our approach on four robotic tasks based on the OpenAI Gym <ref type="bibr">[30]</ref>, Gymnasium-Robotics <ref type="bibr">[31]</ref>, and MuJoCo <ref type="bibr">[32]</ref>: Hopper-v3, HalfCheetah-v3, DoubleInvertedPendulum-v2, and Franka Kitchen. To fit our setting, we remove termination judgments of environments to gain flexibility in behaviors such as the crawling actions of Hopper. In our experiments, we only utilize demonstrations with two different strategies. We collect demonstrations for Franka Kitchen using human teleportation, and generate synthetic demonstrations for the other three tasks.</p><p>As for the baselines, we choose SSRR, MSRD, D-REX, BC, and BC-RNN, and analyze them with different sets of input demonstrations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Synthetic Demonstration Generation</head><p>To generate suboptimal, heterogeneous demonstrations, we leverage Diversity is All You Need (DIAYN) <ref type="bibr">[33]</ref> with an early stopping mechanism, and generate demonstrations with two different strategies. DIAYN trains a discriminator to differentiate policies based on latent variables, z, and utilizes the pseudo-reward defined in Eq. 6, where q &#966; (z|s) is the posterior probability of z given state s, and p(z) is the prior.</p><p>This model encourages learning distinct behaviors but disregards task objectives. Thus, we integrate a linear combination of task rewards like Eq. 1 and diversity rewards to balance task performance and behavior diversity. We employ early stopping while training DIAYN with environment rewards before policies display the perfect performances to create suboptimality. This method follows a standard practice <ref type="bibr">[23]</ref>, producing a set of suboptimal policies. This method generates diverse user-like demonstrations at nonexpert levels.</p><p>Moreover, to create a test set for reward correlation analysis, we utilize different trained DIAYN agents with varying training steps to produce new test datasets, which provide trajectories with performance ranging from low to high. Additional information about demonstrations we use for experiments is shown in Table <ref type="table">II</ref> together with the experiment results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Human Demonstration Collection</head><p>We modify the original Gymnasium-Robotics environment FrankaKitchen-v1 as one of our experiment domains and collect the human demonstrations from experimenters by keyboard teleoperation. The goal of the domain is to use the Franka robot arm to move the kettle to the top left burner. We use two strategies to push the kettle -directly pushing the body part or holding the handle. The reward per step is the negative norm between the kettle position and its goal position.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. RESULT</head><p>In this section, we present the advances of our method in learning from heterogeneous and suboptimal demonstrations compared to the previous works focusing on only one side.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Research Questions</head><p>As for results, we want to test two main hypotheses: H1: the reward learned by our method has a higher correlation with the ground-truth reward than baselines; H2: using the reward to train an RL policy, this learned policy can achieve better than demonstrator's performance and outperform the learned policy generating from baseline method reward functions. Besides the main hypotheses, additional studies on data efficiency and loss function ablation are conducted.</p><p>TABLE I: Learned Reward Correlation Coefficients with Ground-Truth Reward Comparison: Reported results are mean&#177;standard deviation with five different seeds. DoubleInverted Hopper HalfCheetah Pendulum Task Ours 0.949&#177;0.002 0.929&#177;0.003 0.999&#177;0.000 Reward SSRR 0.578&#177;0.057 0.821&#177;0.058 -0.269&#177;0.789 Function MSRD 0.551&#177;0.071 0.709&#177;0.069 -0.303&#177;0.750 D-REX 0.649&#177;0.058 0.779&#177;0.067 -0.393&#177;0.782 Strategy 1 Ours 0.938&#177;0.003 0.930&#177;0.003 0.999&#177;0.000 Reward SSRR 0.933&#177;0.017 0.913&#177;0.038 0.434&#177;0.825 Function MSRD 0.653&#177;0.045 0.627&#177;0.018 -0.361&#177;0.668 Strategy 2 Ours 0.957&#177;0.004 0.928&#177;0.004 0.999&#177;0.000 Reward SSRR 0.916&#177;0.031 0.912&#177;0.024 0.946&#177;0.074 Function MSRD 0.704&#177;0.142 0.656&#177;0.107 -0.269&#177;0.703</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Metrics</head><p>We select two evaluation metrics with different data sets. First, we use the learned reward correlation coefficients with ground truth reward on training and test sets. Note that for SSRR and D-REX, we do not have the task reward as a component, therefore we directly use the overall reward function for correlation evaluation. Second, we measure the ground-truth rewards of RL policies trained with the learned reward functions. High correlation for strategy reward functions indicates that each function not only captures the feature of demonstrators but is also aligned with the underlying task reward and does not drift from the task goal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Correlation Result</head><p>Table I reports the correlation between learned and groundtruth rewards across three domains. Our method consistently achieves higher correlations than baseline approaches. Because our method disentangles latent and strategy-specific rewards from suboptimal, diverse data, it produces three distinct reward functions. For ours and MSRD which both model a task reward explicitly, we report task reward correlation along with the two combined-strategy reward correlations. For SSRR and D-REX, which lack explicit task reward modeling, we train on the combined strategy demonstrations and treat the result as the task reward prediction to ensure a fair comparison across methods handling multi-strategy data.</p><p>As indicated by the results in Table <ref type="table">I</ref>, together with Fig. <ref type="figure">4</ref>, the correlation coefficients of our method with ground-truth rewards are consistently higher across all evaluation settings compared to the two baseline methods, which supports our first hypothesis H1. The results highlight that even when confronted with imperfect and heterogeneous demonstrations, REPRESENT achieves stronger alignment with ground-truth rewards compared to existing methods such as SSRR, MSRD and D-REX, demonstrating the ability to accurately infer the reward function. The results support the hypothesis that decomposed rewards lead to more accurate learning outcomes, thus validating its effectiveness in complex, multi-strategy, and suboptimal environments.</p><p>Notably, combining data from multiple strategies yields lower correlation in SSRR than treating each strategy separately. This suggests that increasing dataset size with diverse strategies can introduce ambiguity that disrupts suboptimalitybased methods, ultimately degrading reward modeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Policy Performance Result</head><p>Table II presents the highest ground-truth reward achieved by the final policy, trained using different reward functions in three benchmark domains averaged over five trials. The results show the advancement of our method in learning higherperformance policies.</p><p>The RL algorithm we utilize to train policies with learned reward functions for REPRESENT, SSRR, and D-REX is Soft Actor-Critic (SAC) <ref type="bibr">[29]</ref>. For each domain, we report the results for two distinct strategies and the learned policy result for our method's task reward. The goal is to evaluate whether the learned reward functions can guide policies to achieve higher ground-truth performance, thereby validating the effectiveness of the reward inference approach. To display the enhancement more intuitively, we include the percentage of improvement of the final policy rewards relative to the rewards of the initial demonstrations. For the policy with task reward, we choose the average reward of all strategy demonstrations as the denominator. Since none of SSRR, MSRD, or D-REX can be directly trained for a task-reward policy, we only compare our method with the input demonstrations for the results of the policy learned from task reward. Our approach consistently performs better than the baseline methods and can attain higher performance than the demonstrator for all testing domains. In the Hopper-v3 environment, REPRESENT achieves 10.8% better task reward than SSRR and 14.5% better than MSRD for Strategy 1, 85.1% better than SSRR and 100.5% better than MSRD for Strategy 2. For the policy with overall task reward, our method reaches 1692.14, exceeding the baselines including D-REX. The results are similar for the HalfCheetah-v3 domain. Our approach still shows superior performance, while SSRR cannot even gain positive rewards for one strategy.</p><p>In the DoubleInvertedPendulum-v2 domain, which is relatively simple and easy to learn, one of the baseline, SSRR, can get a slightly lower but similar optimal-level performance compared to our method. However, REPRESENT displays advantages in terms of efficiency during learning. As shown in Fig. <ref type="figure">5</ref>, our method is able to converge to the same or higher reward level with fewer training steps for each strategy, showing a faster learning process compared to SSRR. For Strategy 1, our method is more stable during the training process. For Strategy 2, both our method and SSRR exhibit significant instability around 19,000 and 26,000 iterations, leading to the temporary drop in performance.</p><p>Additionally, we train BC and BC-RNN policies directly using combined datasets of two strategies' demonstrations. As shown in Table <ref type="table">II</ref>, BC only achieves performance similar to the inputs. In contrast, BC-RNN, despite its ability to handle multimodality, results in lower performance. This is likely due to the increased model complexity and additional parameters of RNNs when demonstration data is diverse and limited. Therefore, for suboptimal and heterogeneous data, BC and its</p><p>TABLE II: The average cumulative rewards and percentage improvement over input demonstrations for baselines. Policy Learned from Strategy 1 Reward Function Demo Ours SSRR MSRD Hopper-v3 1197.93 1316.01 (110%) 1187.50 (99%) 1149.11 (96%) HalfCheetah-v3 1243.13 3930.14 (316%) -256.09 (-21%) 1201.08 (97%) DoubleInvertedPendulum-v2 861.84 9350.52 (1085%) 9290.24 (1078%) 29.67 (3%) Policy Learned from Strategy 2 Reward Function Demo Ours SSRR MSRD Hopper-v3 1587.98 2893.47 (182%) 1563.45 (98%) 1443.17 (91%) HalfCheetah-v3 1172.71 3860.52 (329%) 2013.79 (172%) 1410.34 (120%) DoubleInvertedPendulum-v2 1217.65 9350.47 (768%) 9315.36 (765%) 29.82(2%) Policy Learned from Task Reward Function Policy Learned with Merged Data Demo Ours D-REX BC BC-RNN Hopper-v3 1392.96 1692.14 (121%) 0.91 (0%) 1102.19 (79%) 187.20 (13%) HalfCheetah-v3 1207.92 3540.06(293%) 3148.64 (261%) 960.12 (79%) 799.44 (66%) DoubleInvertedPendulum-v2 1039.75 9350.50 (899%) 83.00 (7%) 722.88 (70%) 129.96 (12%) variant BC-RNN will be limited by the input performance and difficult to have an outperformed policy.</p><p>Our results demonstrate REPRESENT outperforms baseline methods in final policy performance using learned reward functions, supporting hypothesis H2. Its multi-component architecture more effectively captures reward structures and integrates information across diverse strategies -critical when demonstrators exhibit varied, non-expert behavior. Additionally, REPRESENT reaches comparable or superior performance earlier in training, indicating enhanced efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Ablation Study</head><p>We conduct an ablation study to assess the contribution of the SSRR, regularization, and BCD losses in Eq. 2 in Hopper-v3 (Table <ref type="table">III</ref>). Our full model achieves the highest task performance and produces distinct behavioral strategies. Removing the regularization term improves task reward optimization but sacrifices diversity. Removing the BCD loss degrades optimality and diversity.</p><p>TABLE III: Ablation Study in Hopper-v3 Function Strategy 1 Strategy 2 Task Ours 1316.01 2893.47 1692.14 -Regularization Loss 1090.03 2269.25 1785.19 -BCD Loss 1214.00 2000.38 1633.13 SSRR 1187.5 1563.45 N/A F. Human Demonstration Experiment: Franka Kitchen</p><p>Utilizing the demonstrations from Section V-C, we benchmark REPRESENT and BC with the results presented in Table <ref type="table">IV</ref>. REPRESENT framework outperforms BC and the original demonstrations, although some strategies still show room for improvement due to the challenges posed by larger observation and action spaces.</p><p>TABLE IV: Results for FrankaKitchen. Higher is better. Demo REPRESENT BC Strategy Task Strat. 1 -122.86 -121.94 -106.48 -467.95 Strat. 2 -174.46 -169.20 -106.48 -528.68 VII. LIMITATIONS AND FUTURE WORK While REPRESENT shows promising results, two key limitations offer opportunities for future work: &#8226; Real-World Deployment-Evaluations were conducted only in simulations. Future efforts should deploy REP-RESENT on physical robots to evaluate its robustness against real-world challenges like noise, sensor errors, and hardware constraints. &#8226; Scalability-Our current approach assumes a finite number of distinct strategies. As a first step, we extend our analysis of two-strategy models from Section VI to now consider how REPRESENT could leverage three distinct strategies. Results are shown in Table V, indicating promising results for learning strategy and task rewards in HalfCheetah-v3. Future research should further investigate how to scale to larger diversity settings. Furthermore, future work should also investigate how mechanisms to relax the assumption of known strategy labels, e.g. by inferring those labels [28].</p><p>TABLE V: Average cumulative rewards of policies learning from three different strategies in HalfCheetah-v3. Demo. REPRESENT Strat. 1 1243.13 2279.29 Strat. 2 1172.71 3220.52 Strat. 3 1181.35 2664.76 Task 1199.06 2901.24</p><p>VIII. CONCLUSION We propose a novel framework that infers reward functions from suboptimal, heterogeneous demonstrations by separating shared task rewards from strategy-specific elements. By intelligently integrating methods for learning from suboptimal (i.e., SSRR) and heterogeneous (i.e., MSRD) demonstrations, our approach efficiently learns from imperfect demonstrations, outperforming existing methods in virtual environments, achieving better-than-demonstrator performance.</p></div></body>
		</text>
</TEI>
