<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Hierarchical Equivariant Policy via Frame Transfer</title></titleStmt>
			<publicationStmt>
				<publisher>International Conference on Machine Learning</publisher>
				<date>01/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10652223</idno>
					<idno type="doi"></idno>
					
					<author>Haibo Zhao</author><author>Dian Wang</author><author>Yizhe Zhu</author><author>Xupeng Zhu</author><author>Owen Howell</author><author>Linfeng Zhao</author><author>Yaoyao Qian</author><author>Robin Walters</author><author>Robert Platt</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Recent advances in hierarchical policy learning highlight the advantages of decomposing systems into high-level and low-level agents, enabling efficient long-horizon reasoning and precise fine-grained control. However, the interface between these hierarchy levels remains underexplored, and existing hierarchical methods often ignore domain symmetry, resulting in the need for extensive demonstrations to achieve robust performance. To address these issues, we propose Hierarchical Equivariant Policy (HEP), a novel hierarchical policy framework. We propose a frame transfer interface for hierarchical policy learning, which uses the high-level agent's output as a coordinate frame for the low-level agent, providing a strong inductive bias while retaining flexibility. Additionally, we integrate domain symmetries into both levels and theoretically demonstrate the system's overall equivariance. HEP achieves state-of-the-art performance in complex robotic manipulation tasks, demonstrating significant improvements in both simulation and real-world settings.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Learning-based approaches have emerged as a powerful paradigm for developing control policies in sequential decision-making tasks, such as robotic manipulation. By leveraging data-driven methods, policy learning provides a scalable framework for addressing tasks with complex dynamics and high-dimensional observation spaces. Recent advancements in end-to-end policy learning <ref type="bibr">(Zhao et al., 2023;</ref><ref type="bibr">Chi et al., 2023)</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>have shown promising results in</head><p>Proceedings of the 42 nd International Conference on Machine <ref type="bibr">Learning, Vancouver, Canada. PMLR 267, 2025.</ref> Copyright 2025 by the author(s).</p><p>Figure <ref type="figure">1</ref>. Hierarchical Equivariant Policy (HEP) is composed of a high-level agent that predicts a coarse translation, a low-level agent that predicts the fine-grained trajectory, and a novel Frame Transfer interface that transfers the coordinate frame of the lowlevel to the predicted keypose frame from the high-level.</p><p>mapping raw sensory inputs to low-level actions such as end-effector trajectories. While these methods exhibit stateof-the-art performance when large amounts of training data are available, they struggle in scenarios with only limited data, due to the large function space required to parameterize complex end-to-end mappings.</p><p>A promising alternative strategy is to employ a hierarchical structural prior that decomposes the policy into different levels, e.g., a high-level agent responsible for identifying a goal pose and a low-level agent for trajectory refinement. Hierarchical methods can reduce the complexity of the policy function space by delegating long-horizon reasoning to the high-level module and fine-grained control to the low-level module, enabling efficient learning and execution. Despite their promise, one underexplored question in hierarchical policy learning is what is the right interface between different levels. For example, in robotic manipulation, existing hierarchical methods <ref type="bibr">(Ma et al., 2024;</ref><ref type="bibr">Xian et al., 2023)</ref> often impose rigid constraints on the interface between the high-level and low-level agents, where the high-level action is used as the last pose in the low-level trajectory. This constraint limits flexibility and often requires both levels to perform fine-grained reasoning in high-dimensional spaces, negating some of the potential benefits of the hierarchical design. Moreover, prior hierarchical methods focus solely on the hierarchical decomposition and do not exploit the domain symmetries often present in robotic tasks, missing an opportunity to further improve generalization and efficiency.</p><p>In this paper, we propose a novel hierarchical policy learning framework that overcomes these limitations by introducing a more flexible and efficient interface between the highlevel and low-level agents. Specifically, our high-level agent predicts a keypose in the form of a coarse 3D location representing a subgoal of the task. This location is then used to construct the coordinate frame for the low-level policy, enabling it to predict trajectories relative to this keypose frame, as shown in Figure <ref type="figure">1</ref>. This Frame Transfer interface maintains a strong inductive bias (by anchoring the low-level policy to a subgoal) yet offers structural flexibility (allowing the low-level policy to refine trajectories locally). Furthermore, Frame Transfer offers a natural fit for integrating domain symmetry by decomposing it into the global symmetry of the subgoal (i.e., the subgoal should transform with the scene) and a local symmetry of the low-level policy (i.e., it should behave consistently in the local keypose frame). By incorporating equivariant structures at both levels, our entire hierarchical system becomes more robust to spatial variations, resulting in significantly improved sample efficiency. Lastly, to better encode 3D sensory information, we adopt a stacked voxel representation <ref type="bibr">(Zhou &amp; Tuzel, 2018)</ref>, ensuring rich visual features and fast computation.</p><p>We summarize our contributions as follows:</p><p>&#8226; We propose Hierarchical Equivariant Policy (HEP), a novel, sample-efficient hierarchical policy learning framework.</p><p>&#8226; We introduce Frame Transfer as an interface for hierarchical policy learning, providing effective and flexible policy decomposition.</p><p>&#8226; We theoretically demonstrate the equivariance of HEP, showing its spatial generalizability. Although equivariance has been used in policy learning, our work is the first to study it in a hierarchical policy.</p><p>&#8226; We provide a thorough evaluation of our method in both simulation and the real-world. Among 30 RL-Bench <ref type="bibr">(James et al., 2020)</ref> tasks, HEP outperforms state-of-the-art baselines by an average of 10% to 23% in different settings, with particular improvement on tasks requiring fine control or long-horizon reasoning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Learning from Demonstrations (LfD) enables policies to be trained from human demonstrations and generalized to unseen scenarios. One class of LfD learns abstracted keyframe actions <ref type="bibr">(James &amp; Davison, 2022;</ref><ref type="bibr">James et al., 2022;</ref><ref type="bibr">Shridhar et al., 2023;</ref><ref type="bibr">Gervet et al., 2023;</ref><ref type="bibr">Goyal et al., 2023)</ref> in terms of the target pose of the gripper, then uses motion planning to interpolate between keyframes. This formulation enables learning with fewer decision steps, but is not suitable for non-prehensile actions like door opening or wiping <ref type="bibr">(Xian et al., 2023;</ref><ref type="bibr">Ma et al., 2024)</ref>. Another class of LfD mimics the fine-grained trajectory directly <ref type="bibr">(Song et al., 2020;</ref><ref type="bibr">Ye et al., 2022;</ref><ref type="bibr">Toyer et al., 2020;</ref><ref type="bibr">Zhang et al., 2018;</ref><ref type="bibr">Chi et al., 2023;</ref><ref type="bibr">Zhu et al., 2023;</ref><ref type="bibr">Mandlekar et al., 2021;</ref><ref type="bibr">Zhao et al., 2023;</ref><ref type="bibr">Wang et al., 2024)</ref>, enabling broader task coverage but suffering from overloading the model with details <ref type="bibr">(Zhao et al., 2023)</ref>, covariant shift <ref type="bibr">(Ke et al., 2021)</ref>, and poor performance in long-horizon tasks. To bridge these approaches, we introduce Frame Transfer, a novel interface that integrates keyframe-based and trajectory-based models, enhancing flexibility and task adaptability.</p><p>Hierarchical Policy has been explored for action refinement in a coarse-to-fine manner <ref type="bibr">(Levy et al., 2018;</ref><ref type="bibr">Gualtieri &amp; Platt, 2020;</ref><ref type="bibr">James et al., 2022)</ref> or through a two-stage hierarchy for translational and rotational actions <ref type="bibr">(Sharma et al., 2017;</ref><ref type="bibr">Wang et al., 2020;</ref><ref type="bibr">Zhu et al., 2022)</ref>. While these approaches improve over end-to-end policies, they lack integration of keyframe and trajectory actions. Recent works <ref type="bibr">(Xian et al., 2023;</ref><ref type="bibr">Ma et al., 2024)</ref> address this by hierarchically combining a keyframe agent and a trajectory agent, but they fix the goal pose of the trajectory agent with the output from the keyfram agent, limiting flexibility in the low-level and demanding precise reasoning from the high-level agent. In contrast, our framework enables a more adaptable interface between levels, allowing the low-level agent to refine high-level actions.</p><p>Equivariant Robot Learning leverages geometric symmetries in 3-D Euclidean space, which has emerged as a powerful strategy for boosting the sample efficiency, robustness, and generalisation of robotic-manipulation policies.</p><p>An expanding line of research now demonstrates that encoding SE(3) equivariance-invariance to translations and in-plane rotations-directly into the perception-to-action pipeline can dramatically reduce the burden of data collection and retraining when scenes or objects are repositioned.</p><p>Recent works have explored this principle across a wide spectrum of tasks and architectures, from point-cloud networks to image-conditioned diffusion models <ref type="bibr">(Wang et al., 2022;</ref><ref type="bibr">2021;</ref><ref type="bibr">Liu et al., 2023;</ref><ref type="bibr">Kim et al., 2023;</ref><ref type="bibr">Kohler et al., 2023;</ref><ref type="bibr">Nguyen et al., 2023;</ref><ref type="bibr">2024;</ref><ref type="bibr">Eisner et al., 2024;</ref><ref type="bibr">Gao et al., 2024)</ref>. A particularly active subfield studies biequivariant pick-and-place policies that remain consistent under simultaneous transformations of both the target object and the robot gripper. Systems in this class-spanning neural descriptors, transporters, and diffusion planners-have achieved state-of-the-art placement accuracy and transfer-ability <ref type="bibr">(Simeonov et al., 2022;</ref><ref type="bibr">Ryu et al., 2023b;</ref><ref type="bibr">a;</ref><ref type="bibr">Pan et al., 2023;</ref><ref type="bibr">Huang et al., 2022;</ref><ref type="bibr">2024a;</ref><ref type="bibr">c;</ref><ref type="bibr">b)</ref>. Complementary research focuses on equivariant grasp synthesis, showing that symmetry-aware grasp networks can generalise to novel object poses with an order of magnitude fewer demonstrations <ref type="bibr">(Zhu et al., 2022;</ref><ref type="bibr">Huang et al., 2023;</ref><ref type="bibr">Hu et al., 2024;</ref><ref type="bibr">Lim et al., 2024)</ref>. At a finer temporal scale, several groups have begun to embed equivariance into trajectory-generation modules for tasks such as tool use, insertion, or deformable manipulation, reporting sharper motion accuracy and improved long-horizon stability <ref type="bibr">(Jia et al., 2023;</ref><ref type="bibr">Wang et al., 2024;</ref><ref type="bibr">Yang et al., 2024b;</ref><ref type="bibr">a)</ref>. Unlike these independent applications, we integrate equivariance into a hierarchical policy that unifies keyframe and trajectory-based learning.</p><p>3. Background</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Problem Definition</head><p>In this paper, we focus on visuomotor policy learning via Behavior Cloning (BC) in robotic manipulation. We aim to learn a policy &#960; : O &#8594; A to map from the observation space O to the action space A.</p><p>To define the observation and action spaces, let s = (x, y, z, q, c) &#8712; S = R 3 &#215; SO(3) &#215; R be the space of gripper states where (x, y, z) is a 3D position, q &#8712; SO( <ref type="formula">3</ref>) is an orientation, c is a gripper aperature (open width). The observation space is o &#8712; O = R n&#215;(3+k) &#215; S t including both a point cloud P = {p i : p i = (x i , y i , z i , f i ) &#8712; R 3+k } with k dimensional point features (e.g., k = 3 for RGB) and t history steps of the gripper state. The action a = {a 1 , a 2 , . . . , a m } &#8712; A = S m contains m control steps of the gripper state.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Equivariance</head><p>A function f is equivariant if it commutes with the transformations of a symmetry group G, where &#8704;g &#8712; G, f (gx) = gf (x). This is a mathematical way of expressing that f is symmetric with respect to G: if we evaluate f for transformed versions of the same input, we should obtain transformed versions of the same output.</p><p>Our objective is to design a policy that is symmetric (equivariant) under the group g &#8712; T (3) &#215; SO(2), where T (3) represents the group of 3D translations, and SO(2) represents the group of planar rotations around the z-axis of the world coordinate system, &#960;(go) = g&#960;(o). This symmetry captures the ground truth structure in many robotic tasks without enforcing unnecessary out-of-plane rotation equivariance (which is often invalid due to gravity and the canonical pose of objects).</p><p>To define a T (3) &#215; SO(2) equivariant policy, we first need to define how the group element acts on the observation and the action. let g = (t, R &#952; ) &#8712; T (3) &#215; SO(2) where t = (t x , t y , t z ) and R &#952; is the 2 &#215; 2 rotation matrix, g acts on the action a by transforming the gripper pose command. Let R&#952; = R &#952; 0 0 1 , ga = {ga 1 , ga 2 , . . . ga m } where</p><p>g acts on o through transforming the gripper pose in the same way as a, and transforming the point cloud P = {p i :</p><p>where</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Voxel Maps as Function</head><p>In deep learning, voxel maps (3D volumetric data) are typically expressed as tensors. However, it is sometimes convenient to express volumetric data in the form of functions over the 3D space. Specifically, given a one-channel voxel map V &#8712; R 1&#215;D&#215;H&#215;W , we may equivalently express V as a continuous function V : R 3 &#8594; R, where V(x, y, z) describes the intensity value at the continuous world coordinate (x, y, z). Notice that here the domain of V is the 3D world coordinate frame, not the discrete voxel indices. The relationship between the voxel indices and world coordinates is a linear map defined by the spatial resolution and the origin of the voxel grid.</p><p>Similarly, if we have an m-channel voxel map V &#8712; R m&#215;D&#215;H&#215;W , we can interpret it as V : R 3 &#8594; R m , where each point (x, y, z) in the volume maps to an m-dimensional feature vector. The group g = (t, &#952;) &#8712; T (3) &#215; SO(2) acts on a voxel feature map as</p><p>) where t &#8712; T (3) acts on V by translating the voxel location, while &#952; acts on V by both rotating the voxel location and transforming the feature vector via &#961;(&#952;) &#8712; GL(m), an m &#215; m invertible matrix known as a group representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Hierarchical Equivariant Policy</head><p>The main contribution of our paper is a Hierarchical Equivariant Policy that leverages equivariant learning in both the high-level and low-level agents and employs a novel frame transfer interface to connect them. In this section, we first introduce the overview of our hierarchical policy structure and the novel frame transfer interface. Then, we describe the high-level and low-level agents in detail.</p><p>The overview of our system is shown in Figure <ref type="figure">2</ref>. We factor the policy learning problem into a two-step action prediction using a high-level agent &#960; high and a low-level agent &#960; low ,</p><p>where t high &#8712; T (3) is a 3D translation predicted by &#960; high . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Frame Transfer Interface</head><p>The effectiveness of a hierarchical policy depends largely on the design of the high-level action output and its integration with the low-level agent. Prior approaches <ref type="bibr">(Xian et al., 2023;</ref><ref type="bibr">Ma et al., 2024)</ref> often constrain the high-level agent to predict an SE(3) pose, which is then treated as a rigid constraint for the low-level agent by enforcing it as the endpoint of the low-level trajectory. While this design simplifies task decomposition, it restricts flexibility and imposes computational burdens on the high-level agent, which must reason about precise pose constraints in high-dimensional spaces.</p><p>To overcome these limitations, we propose a flexible and efficient Frame Transfer interface (Figure <ref type="figure">2</ref> middle) between the high-and low-level agents by only passing a T (3) frame rather than constraining the pose. Specifically, our highlevel agent predicts a 3D translation t high , which is used as a canonical reference frame for the low-level agent,</p><p>where t high is the 3D translation (i.e., a keypose) predicted by the high-level agent, and &#981; is a trajectory generator that produces a trajectory based on the transformed observation.</p><p>Frame Transfer function, which translates the (x, y, z) component of the input observation or action to the input keypose frame. Specifically, we define the +,operators between o or a and t high as addition and subtraction on the (x, y, z) component of o or a. For example, for a = (x, y, z, q, c), a + t high = ((x, y, z)+t high , q, c). The Frame Transfer function &#964; is then defined as &#964; (o,</p><p>The Frame Transfer interface offers several advantages. First, it provides an efficient mechanism that geometrically embeds the high-level action directly into the input of the low-level agent, ensuring seamless communication between the two levels. Second, by representing observations and trajectories in a relative frame, it introduces translation invariance to the low-level agent, simplifying its learning process and improving robustness. Third, unlike prior works <ref type="bibr">(Xian et al., 2023;</ref><ref type="bibr">Ma et al., 2024)</ref> which treat the high-level prediction as a rigid motion planning constraint (thus forcing the high-level agent to generate accurate SE(3) poses and limiting the policy in an open-loop manner), our approach interprets the high-level output as a flexible constraint. This flexibility reduces the computational burden on the highlevel agent, as it only predicts a 3D translation, while preserving the system's capability to operate in both open-loop and closed-loop control settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">High-level Agent</head><p>To efficiently predict the high-level action t high &#8712; T (3), we represent it as a voxel map discretizing V a : R 3 &#8594; R where V a (t) represents the probability of translation t (see subsection 3.3) . This provides a dense spatial representation and naturally handles translation multi-modality <ref type="bibr">(Shridhar et al., 2023)</ref>. The center of the voxel with the highest predicted probability is then selected as the high-level agent's final output, t high = arg max V a . Accordingly, the input observation is voxelized to V o : R 3 &#8594; R 3 (where the output of V o is RGB), and we use an</p><p>The entire high-level structure is shown in Figure <ref type="figure">2</ref> top.</p><p>During training, the high-level agent's objective is to minimize the discrepancy between its predicted voxel heatmap V a and the ground truth one-hot heatmap V * a , derived from expert demonstrations, using the cross-entropy loss,</p><p>where Va (x, y, z) is the probability for voxel (x, y, z) obtained by applying a softmax over the predicted heatmap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Stacked Voxel Representation</head><p>As our high-level agent uses 3D voxel grids as the visual input, the voxel encoder plays a crucial role in the policy. Standard 3D convolutional encoders impose a heavy computational burden, which often requires aggressive resolution compression that reduces the fine details in the observation.</p><p>To address this limitation, we adopt Stacked Voxels <ref type="bibr">(Zhou &amp; Tuzel, 2018)</ref> from the 3D vision literature, which preserve fine-grained spatial cues by replacing voxel downsampling with a PointNet <ref type="bibr">(Qi et al., 2017</ref>) that aggregates information from all points within the spatial extent of each voxel.</p><p>Specifically, given a point cloud P , we first partition it into H &#215; W &#215; D point sets, where each set P j &#8838; P corresponds to the points contained within a voxel j in the H &#215; W &#215; D voxel grid. Each point set P j is processed by an equivariant PointNet l : P j &#8594; V(j x , j y , j z ) to produce a c-dimensional aggregated feature vector for the voxel j. Repeating for all voxels results in a voxel grid feature map with dimensions c &#215; H &#215; W &#215; D. This feature map is then used as input to subsequent 3D convolutional networks.</p><p>This process, illustrated in Figure <ref type="figure">3</ref>, retains more nuanced shape information compared to simple voxel downsampling. Moreover, we prove that the stacked voxel representation maintains equivariance. (See proof in Appendix C.)</p><p>-equivariant and T (3)-invariant, i.e., l(gP j ) = &#961;(&#952;)l(P j ), then the stacked voxel representation &#957; : P &#8594; V s.t. &#957;(P )(j x , j y , j z ) = l(P j ) is T(3) &#215; SO(2)equivariant, i.e., &#957;(gP ) = g&#957;(P ).</p><p>In practice, we implement the T (3)-invariance in the Point-Net by using the relative position to the center of each voxel, and implement the SO(2)-equivariance using escnn <ref type="bibr">(Cesa et al., 2022)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Low-level Agent</head><p>After predicting the high-level action t high and using Frame Transfer to canonicalize the observation, our low-level trajectory generator &#981; needs to create an SE(3) trajectory for the robot gripper. As shown in Figure <ref type="figure">2</ref> bottom, we first process the observation with a stacked voxel encoder, then leverage an equivariant diffusion policy <ref type="bibr">(Wang et al., 2024)</ref> to represent the policy &#981;, which denoises the trajectory from a randomly sampled noisy trajectory. Specifically, we model a conditional noise prediction function &#949; : o, a k , k &#8594; e k , where the observation o is the denoising conditioning, a k is a noisy action, k is the denoising step, and e k is the predicted noise in a k s.t. the noise-free action a = a k -e k . The model &#949; is implemented as an SO(2)-equivariant function, &#949;(go, ga k , k) = g&#949;(o, a k , k), to ensure the policy &#981; it represents is SO(2)-equivariant, &#981;(go) = g&#981;(o). See <ref type="bibr">(Wang et al., 2024)</ref> for more details.</p><p>During training, given an expert observation trajectory pair (o, a), we first use the translation t n from the last step a n as the keypose, then apply frame transfer to get</p><p>The low-level loss is</p><p>where e k is a random noise conditioned on a randomly sampled denoising step k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Symmetry of Policy</head><p>In this section, we describe the overall T (3)&#215;SO(2) symmetry of our hierarchical architecture. As is shown in Figure <ref type="figure">4</ref>, a transformation in the observation should lead to the same transformation in both levels of HEP. Specifically, we decompose the symmetry into a rotation and translation, and prove each separately.</p><p>Let &#960; be a hierarchical policy composed of a high-level agent &#960; high , a low-level agent &#960; low , and frame-transfer functions &#964; (see section 4).</p><p>Proposition 4.2 (Hierarchical SO(2) Equivariance). &#960; is SO(2)-equivariant when the following assumptions hold for g &#8712; SO(2):</p><p>1. The high-level policy &#960; high is SO(2)-equivariant,</p><p>3. The Frame Transfer function &#964; is SO(2)-equivariant.</p><p>In Appendix A we show that the entire hierarchical policy &#960; is SO(2)-equivariant so that rotating the observation o results in an action rotated in the same way.</p><p>Proposition 4.3 (Hierarchical T(3) Equivariance). &#960; is T (3)-equivariant when the following assumptions hold for t &#8712; T (3)</p><p>2. The Frame Transfer function &#964; is T(3)-invariant, and satisfies &#964; (o, t high ) = &#964; (o + t, t high + t)</p><p>Notably, even if the low-level policy &#960; low is not T(3)equivariant, the entire hierarchical policy &#960; is T(3)equivariant. This is proven in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Simulation Experiment</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental Settings</head><p>To evaluate our policy, we first perform experiments in simulated environments in the RLBench <ref type="bibr">(James et al., 2020)</ref> benchmark implemented using CoppeliaSim <ref type="bibr">(Rohmer et al., 2013)</ref> and PyRep <ref type="bibr">(James et al., 2019)</ref>. The simulated environments contain a 7-joint Franka Panda robot equipped (a) Open Microwave (b) Stack Wine (c) Shoes Out of Box Figure 5. The Simulation Tasks from RLBench (James et al., 2020). See Appendix F for all environments.</p><p>with a parallel gripper, as well as four RGB-D cameras to provide the point cloud observation.</p><p>We evaluate our model on 30 RLBench tasks, among which 20 are widely used in the prior works like <ref type="bibr">(Xian et al., 2023)</ref>. The remaining 10 are challenging tasks that demand precise control, such as Lamp On, or long-horizon planning, like Push 3 Buttons. A subset of the 30 simulation tasks is shown in Figure <ref type="figure">5</ref>. Each task is trained using 100 demonstrations, more detailed task descriptions and visualizations are provided in Appendix F.</p><p>We consider two different control settings, open-loop and closed-loop control. In closed-loop, we use each control step in the dataset as the low-level's target, and next keyframe is used as the label for the high-level agent. In open-loop, we use the keyframe (i.e., some key actions in the entire trajectory like pick, place, etc.) defined by the prior work <ref type="bibr">(Shridhar et al., 2023)</ref> as the target for the high-level agent, then construct the low-level target by interpolating between the consecutive keyframes. In principle, the open-loop setting requires fewer prediction steps to finish a task, while the closed-loop setting makes the policy more responsive. Thanks to the flexibility of our Frame Transfer interface, our policy can operate in both settings, while some prior works are limited in the open-loop setting.   Figure <ref type="figure">6c</ref>: Blocks stacking, the robot needs to stack three blocks one by one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Baseline</head><p>We compare our method against the following baselines. 3D Diffuser Actor: an open-loop agent that combines diffusion policies <ref type="bibr">(Chi et al., 2023)</ref> with 3D scene representations.</p><p>Chained Diffuser: an open-loop hierarchical agent that uses Act3D <ref type="bibr">(Gervet et al., 2023)</ref> in the high-level and diffusion policy in the low-level. Equivariant Diffusion Policy (EquiDiff): an SO(2)-equivariant, closed-loop policy that applies equivariant denoising.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Results</head><p>Table <ref type="table">1</ref> presents the comparison in terms of the evaluation success rates of the last checkpoint across 100 trials.</p><p>Open-loop Results: Our model outperforms the baselines in 28 out of the 30 tasks, achieving an average absolute improvement of 10%. The task where HEP falls short of achieving the best results is Take Money. Further investigation reveals that HEP achieves 98% success rate at earlier checkpoints but fails at the final checkpoint, likely due to overfitting. Tasks involving precise actions or long-horizon trajectories e.g., Lamp-on and Push 3 Buttons also exhibited consistently high success rates, demonstrating the adaptability of our method to diverse task requirements.</p><p>We also compare our model with hierarchical diffusion policy <ref type="bibr">(Ma et al., 2024)</ref> in Appendix H</p><p>Closed-loop Results: Here we consider 10 selected tasks that represent the full diversity and complexity of the complete task set. The closed-loop setting requires longerhorizon trajectories, making it harder to succeed in evaluation. Despite this, our model consistently outperforms EquiDiff across all 10 tasks, achieving an average absolute improvement of 23%. This improvement underscores the effectiveness of HEP in handling the increased complexity of long-horizon decision-making. To validate the impact of our contributions,we perform an ablation study in six tasks considering the following configurations: No Hierarchy: removes high-level agent and uses low-level agent only. No Equi: same architecture but removes all equivariant structure. No Stacked Voxel: removes the stacked voxel encoder. No FT: removes the Frame Transfer interface and uses the high-level action as an additional conditioning in the low-level. No Equi No FT: combination of No Equi and No FT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Ablation Study</head><p>As is shown in Table <ref type="table">2</ref>, removing equivariance makes the most significant negative impact on our model, reducing the mean success rate by 24%. The performance drop when removing Frame Transfer and stacked voxel encoder is 16% and 10%, respectively, demonstrating the importance of all three key pieces of our model. Moreover, the 10% performance difference between No Equi and No Equi No FT shows the potential of Frame Transfer beyond our model. See Table <ref type="table">7</ref> in the Appendix for the full table.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Real-World Experiment</head><p>In this section, we evaluate our method on a real robot system comprised of a UR5 robot and 3 Intel Realsense <ref type="bibr">(Keselman et al., 2017)</ref> D455 RGBD sensors. Details on the experiment setting are given in Appendix G.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Baseline Comparison</head><p>We experiment in three tasks as shown in Figure <ref type="figure">6</ref>. These tasks are challenging due to their extreme long horizon (can be divided into 6 to 9 sub-tasks) and the diverse types of manipulation involved. Evaluations are conducted in 20 trials: 10 with object placements similar to the training dataset's and 10 with unseen placements.</p><p>As shown in Table <ref type="table">3</ref>, our model successfully completes the tasks under open-loop control. Most failures occur due to the slight misalignment between the gripper and the object, likely caused by poor depth quality of the sensors. We further evaluate our model in a closed-loop setting, where it achieves similar performance to the open-loop version in two of the three tasks. However, in Pot Cleaning, while the agent progresses further in the task, it becomes stuck in a recurrent cleaning loop. This likely results from the lack of history information in the observations, preventing the agent from recognizing when to exit the cleaning phase. In  contrast, the open-loop version follows a single keypose for cleaning, facilitating a smoother transition to the next stage.</p><p>One-Shot Generalization To evaluate the generalizability of our model, we perform a one-shot experiment where the model is trained to finish a pick-place task with only one demonstration. During testing, the object is placed in unseen poses, as shown in Figure <ref type="figure">7</ref>. The results in Table <ref type="table">4</ref> demonstrate the strong generalizability of our model, achieving an 80% success rate over 20 trials. For comparison, we evaluate Chained Diffuser under the same setting, but it only succeeded when the toy car was positioned exactly as in the demonstration. This result highlights the superior generalization ability of our approach, enabling robust execution of manipulation tasks from limited training data. Robust to Environment Variations In this experiment, we evaluate the robustness of our trained model under environmental variations. Specifically, we introduced modifications to the Block Stacking task during test time by changing the color of the table (Color) and additionally adding un- 0.9 0.9 0.6 related objects as distractors (Color+Objects), as shown in Figure <ref type="figure">8</ref>. The result is shown in Table <ref type="table">5</ref>. Surprisingly, our model demonstrates exceptional adaptability, achieving 90% and 60% success rate under those two test-time variations, whereas the baseline fails to complete the task with those distractions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>In this work, we propose an Hierarchical Equivariant Policy for visuomotor policy learning. By utilizing Frame Transfer, our architecture naturally has both translational and rotational equivariance. Experimentally, HEP achieves significantly higher performance than previous methods on behavior cloning tasks that require fine motor control.</p><p>While our work provides a solid foundation for hierarchical policies with geometric structure, several future directions remain open for exploration. One key limitation is that our experiments focus on tabletop manipulation. Extending HEP to more complex robotic tasks, such as humanoid motion, is a promising direction. Another limitation is the lack of memory mechanisms, which can be critical for tasks requiring history information. Future work could explore integrating Transformers <ref type="bibr">(Vaswani, 2017)</ref> to enhance temporal reasoning. Finally, expanding Frame Transfer to incorporate both translational and rotational specification could further improve the effectiveness of hierarchical policies.</p><p>A. Proof: The Full Policy is SO(2) Equivariant</p><p>Let us prove that the policy is SO(2) Equivariant and satisfies</p><p>We will prove this in two steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. Low-level Equivariance</head><p>First, let us prove that the low-level agent is SO(2) equivariant. The low-level policy can be written as</p><p>The frame transfer functions satisfy</p><p>and the diffusion policy satisfies</p><p>Thus, we have that</p><p>Using the frame transfer function property &#964;</p><p>Using the SO(2) equivariance of the diffusion policy and the properties of the frame transfer functions, we have that</p><p>And by definition,</p><p>Thus, we have that</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>. Full Policy Equivariance</head><p>Using the equivariance of the low-level policy, let us show that the full policy is SO( <ref type="formula">2</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Proof: The Full Policy is T (3) Equivariant</head><p>As defined in subsection 4.1, +,as operators between o or a and t high as addition and subtraction on the (x, y, z) component of o or a. Similarly, we can define the translation t &#8712; T (3) acting on o or a as an addition to the (x, y, z) component as o + t or a + t.</p><p>First, suppose that the high-level policy &#960; high (o) is T (3)-equivariant so that &#8704;t &#8712; T (3), &#960; high (o + t) = t + &#960; high (o) which is simply the statement that shifting the scene shifts the high-level policy in the same way. Now, we will show that the full hierarchical policy satisfies the equivariance condition. The low-level policy is determined by  </p></div></body>
		</text>
</TEI>
