<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Leveraging symmetries in pick and place</title></titleStmt>
			<publicationStmt>
				<publisher>Sage Journals</publisher>
				<date>04/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10558742</idno>
					<idno type="doi">10.1177/02783649231225775</idno>
					<title level='j'>The International Journal of Robotics Research</title>
<idno>0278-3649</idno>
<biblScope unit="volume">43</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>Haojie Huang</author><author>Dian Wang</author><author>Arsh Tangri</author><author>Robin Walters</author><author>Robert Platt</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Robotic pick and place tasks are symmetric under translations and rotations of both the object to be picked and the desired place pose. For example, if the pick object is rotated or translated, then the optimal pick action should also rotate or translate. The same is true for the place pose; if the desired place pose changes, then the place action should also transform accordingly. A recently proposed pick and place framework known as Transporter Net (Zeng, Florence, Tompson, Welker, Chien, Attarian, Armstrong, Krasin, Duong, Sindhwani et al., 2021) captures some of these symmetries, but not all. This paper analytically studies the symmetries present in planar robotic pick and place and proposes a method of incorporating equivariant neural models into Transporter Net in a way that captures all symmetries. The new model, which we call Equivariant Transporter Net, is equivariant to both pick and place symmetries and can immediately generalize pick and place knowledge to different pick and place poses. We evaluate the new model empirically and show that it is much more sample-efficient than the non-symmetric version, resulting in a system that can imitate demonstrated pick and place behavior using very few human demonstrations on a variety of imitation learning tasks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Pick and place is an important paradigm in robotic manipulation where a complex manipulation problem can be decomposed into a sequence of grasp (pick) and place operations. Recently, multiple learning approaches have been proposed to solve this problem, including <ref type="bibr">Zeng et al. (2021)</ref>; <ref type="bibr">Wang et al. (2021)</ref>. These methods focus on a simple version of the planar pick and place problem where the method looks at the scene and outputs a single pick and a single place pose. This problem has an important structure in the form of symmetries in SE(2) that can be expressed with respect to the pick and place pose. The pick symmetry is easiest to see. If the object to be grasped is rotated (in the plane), then the optimal grasp pose clearly must also rotate. A similar symmetry exists in place pose. If an object is to be placed into an environment in a particular way, then if the environment rotates, the desired place pose must also rotate. Leveraging symmetries of the task could result in significant gains in sample efficiency <ref type="bibr">(Zhu et al., 2022;</ref><ref type="bibr">Jia et al., 2023)</ref>. Why is sample efficiency important in robot learning? Although robotic simulators could provide a huge amount of data that could be used to train a policy, there is an inevitable sim-toreal gap in applying the learned policy directly to real robots. On the other side, real-world robot data is expensive to collect, and sample efficiency is crucial to learning a policy with a limited number of human demonstrations.</p><p>If we are to design a robotic learning system for pick and place, it should ideally encode the symmetries described above. This is a structure that exists in the problem and there is a possibility to simplify learning by encoding this structure into our learned solutions. The question is how to accomplish this. This paper examines the symmetries that exist in the pick and place problem by identifying invariant and equivariant equations that we would expect to be preserved. Then, we consider existing pick and place models and find that those architectures only express some but not all problem symmetries. Finally, we propose a novel pick and place model that we call Equivariant Transporter Net that encodes all symmetries and shows that it outperforms models that do not preserve the relevant symmetries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Symmetries in transporter net</head><p>This paper builds on top of the Transporter Net model <ref type="bibr">(Zeng, Florence, Tompson, Welker, Chien, Attarian, Armstrong, Krasin, Duong, Sindhwani et al., 2021)</ref>. Transporter Net is a sample-efficient model for learning planar pick and place behaviors through imitation learning. Compared to many other approaches <ref type="bibr">(Qureshi et al., 2021;</ref><ref type="bibr">Curtis et al., 2022)</ref>, it does not need to be pre-trained on the involved objects-it only needs to be trained on the given demonstrations. Transporter Net achieves sample efficiency in this setting by encoding the symmetry of the picked object into the model. Once the model learns to pick and place an object presented in one orientation, that knowledge immediately generalizes to a finite set of other pick poses. This is illustrated in Figure <ref type="figure">1a</ref>. The left side of Figure <ref type="figure">1</ref>(a) shows a pick-place problem where the robot must pick the orange object and place it inside the green outline. Because the model encodes the symmetry of the picked object, the ability to solve the place task on the left side of Figure <ref type="figure">1(a)</ref> immediately implies an ability to solve the place task on the right side of Figure <ref type="figure">1</ref>(a) where the object to be picked has been rotated. We will refer to this as a SO(2)-place symmetry. Since Transporter Net used a set of discrete rotations, it actually achieves C n -place symmetry where C n is the finite cyclic subgroup of SO(2) that contains a set of n rotations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Equivariant Transporter Net</head><p>This paper analyzes the symmetries present in the pick and place problem and expands Transporter Net in the following ways. First, we constrain the pick model to be equivariant (an expression of symmetry) with respect to the SO(2) group by incorporating equivariant convolutional layers into the pick model. This is, if there is a rotation on the object to be picked, the pick pose will also rotate. We refer to this as a SO(2)-pick symmetry. The second way we extend Transporter Net is by making it equivariant with respect to changes in place orientation. That is, if the place model learns how to place an object in one orientation, that knowledge generalizes immediately to different place orientations. Our resulting placing model is equivariant both to changes in pick and place orientation, and can be viewed as a direct product of two groups, SO(2) &#215; SO(2) as illustrated in Figure <ref type="figure">1(b)</ref>. This expanded symmetry improves the sample efficiency of our model by enabling it to generalize over a larger set of problems. Finally, we also propose a goal-conditioned version of Equivariant Transporter Net where the desired place pose is provided to the system in the form of an image as shown in Figure <ref type="figure">8</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">Contributions</head><p>Our specific contributions are as follows. (1) We systematically analyze the symmetries present in the planar pick and place problem. (2) We propose Equivariant Transporter Net, a novel version of Transporter Net that has C nequivariant pick symmetry and C n &#215; C n -equivariant place symmetry. 1 (3) We propose a variation of Equivariant Transporter Net that can be used with standard grippers rather than just suction cups. (4) We propose a goalconditioned version of Equivariant Transporter Net. (5) We evaluate the approach both in simulation tasks and on physical robot versions of three of the gripper tasks. Our results indicate that our approach is more sample-efficient than the baselines and therefore learns better policies from a small number of demonstrations. Video and code are available at <ref type="url">https://haojhuang.github.io/etp_page/</ref>.</p><p>This paper extends the recent work <ref type="bibr">(Huang, Wang, Walters and Platt, 2022a)</ref> in the following ways. First, we cover the concepts, algorithms, and results in a more comprehensive way. Second, we generalize our proofs of equivariance from C n to any subgroup of SO(2). We also analyze the extension to SO(3) mathematically and provide intuition. Third, we propose a goal-conditioned extension of the work and show that the new method outperforms on the benchmark of goalconditioned tasks. Finally, we add an ablation study that characterizes the model for differently sized cyclic groups, C n .  <ref type="bibr">(Zeng et al., 2021)</ref> learns to place an object when it is presented in one orientation, the model is immediately able to generalize to new object orientations. (b) Our proposed Equivariant Transporter Network is able to generalize over both pick and place orientation. We view this as SO(2) &#215; SO(2)-place symmetry of the model. <ref type="bibr">Gualtieri and Platt, 2021)</ref> assumes that object mesh models are available in order to run ICP <ref type="bibr">(Besl and McKay, 1992)</ref> and align the object model with segmented observations or completions <ref type="bibr">(Yuan et al., 2018;</ref><ref type="bibr">Huang et al., 2021)</ref>. Other work learns a category-level pose estimator <ref type="bibr">(Yoon et al., 2003;</ref><ref type="bibr">Deng et al., 2020)</ref> or key-point detector <ref type="bibr">(Nagabandi et al., 2020;</ref><ref type="bibr">Liu et al., 2020;</ref><ref type="bibr">Manuelli et al., 2019)</ref> from training on a large dataset. Recently, <ref type="bibr">Wen, Lian, Bekris and Schaal (2022)</ref> realizes a close-loop intra-category policy by mimicking the extracted pose trajectory from a few video demonstrations. However, these methods often require expensive object-specific labels or pre-training, making them difficult to use widely. Recent advances in deep learning have provided other ways to rearrange objects from perceptual data. <ref type="bibr">Qureshi et al. (2021)</ref> represent the scene as a graph over segmented objects to do goal-conditioned planning; <ref type="bibr">Curtis et al. (2022)</ref> propose a general system consisting of a perception module, grasp module, and robot control module to solve multi-step manipulation tasks. These approaches often require prior knowledge like good segmentation module and human-level hierarchy. End-toend models <ref type="bibr">(Zakka et al., 2020;</ref><ref type="bibr">Khansari et al., 2020;</ref><ref type="bibr">Devin et al., 2020;</ref><ref type="bibr">Berscheid et al., 2020)</ref> that directly map input observations to actions can learn quickly and generalize well. <ref type="bibr">Shridhar, Manuelli and Fox (2022a)</ref> learn one multitask policy with language-conditioned imitation learning. <ref type="bibr">Shridhar, Manuelli and Fox (2022b)</ref> directly extend this idea to 3D keyframe multi-task policy learning with Perceiver IO transformer <ref type="bibr">(Jaegle et al., 2021)</ref>. <ref type="bibr">Wu et al. (2020)</ref> achieve fast learning speed on deformable-object manipulation tasks with reinforcement learning. However, most methods need to be trained on large datasets. For example, <ref type="bibr">Khansari et al. (2020)</ref> collects a dataset with 7.2 million samples. <ref type="bibr">Devin et al. (2020)</ref> collects 40K grasps and places per task. <ref type="bibr">Zakka et al. (2020)</ref> collects 500 disassembly sequences for each kit. The focus of this paper is on improving the sample efficiency of this class of methods on various manipulation tasks.</p><p>1.4.2. Equivariance learning in manipulation. Fully Convolutional Networks (FCNs) are translationally equivariant and have been shown to improve learning efficiency in many manipulation tasks <ref type="bibr">(Zeng, Song, Yu, Donlon, Hogan, Bauza, Ma, Taylor, Liu, Romo et al., 2018b;</ref><ref type="bibr">Morrison et al., 2018)</ref>. The idea of encoding SE(2) symmetries in the structure of neural networks is first introduced in G-Convolution <ref type="bibr">(Cohen and Welling, 2016)</ref>. The extension work proposes an alternative architecture, Steerable CNN <ref type="bibr">(Cohen and Welling, 2017)</ref>. <ref type="bibr">Weiler and Cesa (2019)</ref> propose a general framework for implementing E(2)-Steerable CNNs. <ref type="bibr">Weiler, Geiger, Welling, Boomsma and Cohen (2018)</ref> first investigated the SE(3) steerable convolution kernels for volumetric data with the trick of vectorizing. <ref type="bibr">Cesa et al. (2021)</ref> parameterizes filters with a band-limited basis to build E(3)-steerable kernels. <ref type="bibr">Thomas et al. (2018)</ref> and <ref type="bibr">Fuchs et al. (2020)</ref> extended the equivariance to graph neural networks.</p><p>In the context of robotics learning, <ref type="bibr">Zhu et al. (2022)</ref> decouple rotation and translation symmetries to enable the robot to learn a planar grasp policy online within 1.5 h. Compared with <ref type="bibr">Zhu et al. (2022)</ref> that formulated the planar grasping task as a bandit problem, our work focuses on pickplace tasks and learns from demonstrations. <ref type="bibr">Wang et al. (2022)</ref> use SE(2) equivariance in Q learning to solve multistep sequential manipulation pick-place tasks. Compared with <ref type="bibr">Wang et al. (2022)</ref>, our work leverages the larger SO(2) &#215; SO(2) symmetry group for the pick-conditioned place policy and tackles rearrangement tasks through the imitation learning <ref type="bibr">(Hussein et al., 2017;</ref><ref type="bibr">Hester et al., 2018;</ref><ref type="bibr">Vecerik et al., 2017)</ref>. Recently, various SE(3) equivariant architectures <ref type="bibr">(Thomas et al., 2018)</ref>; <ref type="bibr">Fuchs et al., 2020;</ref><ref type="bibr">Chen et al., 2021;</ref><ref type="bibr">Deng et al., 2021)</ref> have been proposed and applied to solve manipulation problems. <ref type="bibr">Simeonov et al. (2022)</ref> use Vector Neurons <ref type="bibr">(Deng, Litany, Duan, Poulenard, Tagliasacchi, and Guibas, 2021)</ref> to get SE(3)-invariant object representations so that the model can manipulate objects in the same category with a few training demonstrations. <ref type="bibr">Huang, Wang, Zhu, Walters and Platt (2022b)</ref> leverages the SE(3) invariance of the grasping evaluation function to enable better grasping performance. <ref type="bibr">Xue et al. (2022)</ref> use SE(3)-equivariant key points to infer the object's pose for pick and place. However, most SE(3)-equivariant pick-place methods <ref type="bibr">(Simeonov et al., 2022;</ref><ref type="bibr">Xue et al., 2022</ref>) require a segmentation model and a pre-trained point descriptor for each category, which limits their adaptations to various tasks. Although our proposed pickplace symmetry is defined on SE(2) in this work, we will briefly analyze how to extend the idea to SE(3)-pick-place problems in Proposition 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background on symmetry groups</head><p>2.1. The groups SO(2) and C n In this work, we primarily focus on rotations expressed by the group SO(2) and its cyclic subgroup C n 4 SO(2). SO(2) contains the continuous planar rotations {Rot &#952; : 0 &#8804; &#952; &lt; 2&#960;}. The discrete subgroup C n = {Rot &#952; : &#952; 2 {2&#960;i/n|0 &#8804; i &lt; n}} contains only rotations by angles which are multiples of 2&#960;/ n. The special Euclidean group SE&#240;2&#222; &#188; SO&#240;2&#222; &#215; R 2 describes all translations and rotations of R 2 . 1. The trivial representation &#961; 0 : SO(2) &#8594; GL 1 assigns &#961; 0 (g) = 1 for all g 2 G, that is, no transformation under rotation. 2. The standard representation</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Representation of a group</head><p>represents each group element by its standard rotation matrix. Notice that &#961; 0 and &#961; 1 can be used to represent elements from either SO(2) or C n .</p><p>3. The regular representation &#961; reg of C n acts on a vector in R n by cyclically permuting its coordinates &#961; reg (Rot 2&#960;/</p><p>). We can rotate by multiples of 2&#960;/n by &#961; reg &#240;Rot 2&#960;i=n &#222; &#188; &#961; reg &#240;Rot 2&#960;=n &#222; i . 4. The quotient representation of C n for k dividing n is denoted &#961; C n =C k quot and acts on R n=k by permuting |C n |/|C k | channels: &#961; C n =C k quot &#240;Rot 2&#960;i=n &#222;&#240;x&#222; j &#188; &#240;x&#222; j&#254;i mod&#240;n=k&#222; , which implies features that are invariant under the action of C k . 5. The irreducible representation &#961; i</p><p>irrep could be considered as the basis function with the order/frequency of i, such that any representation &#961; of G could be decomposed as a direct sum of them:</p><p>irrep &#222;Q, where Q is an orthogonal matrix.</p><p>For more details, we refer interesting readers to <ref type="bibr">Serre (1977)</ref>, <ref type="bibr">Weiler and Cesa (2019)</ref>, <ref type="bibr">Lang and Weiler (2020)</ref>, and <ref type="bibr">Cesa et al. (2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Feature map transformations</head><p>We formalize images and 2D feature maps as feature vector fields, that is, functions f : R 2 &#8594; R c , which assign a feature vector f &#240;x&#222; 2 R c to each position x 2 R 2 . While in practice we discretize and truncate the domain of f {(i, j): 1 &#8804; i &#8804; W, 1 &#8804; j &#8804; W}, here we will consider it to be continuous for the purpose of analysis. The action of an element g 2 SO(2) on f is a combination of a rotation in the domain of f via &#961; 1 (this rotates the pixel positions) and a transformation in the channel space R c (aka. fiber space) by &#961; 2 {&#961; 0 , &#961; 1 , &#961; reg , &#961; irrep }. If &#961; = &#961; reg , then the channels cyclically permute according to the rotation. If &#961; = &#961; 0 , the channels do not change. We denote this action (the action of g on f via &#961;) by</p><p>For example, the action of T &#961; reg g &#240;f &#222; is illustrated in Figure <ref type="figure">2</ref> for a rotation of g = &#960;/2 on a 2 &#215; 2 image f that uses &#961; reg . The expression &#961; 1 (g) &#192;1 x rotates the pixels via the standard representation. Multiplication by &#961;(g) = &#961; reg (g) permutes the channels. For brevity, we will denote T reg g &#188; T &#961; reg g and T 0 g &#188; T &#961; 0 g .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Equivariant mappings and steerable kernels</head><p>A function F is equivariant if it commutes with the action of the group,</p><p>where T in g transforms the input to F by the group element g while T out g transforms the output of F by g. For example, if f is an image, then SO(2)-equivariance of F implies that it acts on f in the same way regardless of the orientation in which f is presented. That is, if F takes an image f rotated by g (RHS of equation ( <ref type="formula">2</ref>)), then it is possible to recover the same output by evaluating F for the un-rotated image f and rotating its output (LHS of equation ( <ref type="formula">2</ref>)). The most equivariant mappings between spaces of feature fields are convolutions with G-steerable kernels <ref type="bibr">(Weiler et al., 2018;</ref><ref type="bibr">Jenner and Weiler, 2021)</ref>. Denote the input field type as &#961; in : G &#8594; R d in &#215;d in and the output field type as &#961; out : G &#8594; R d out &#215;d out . The G-steerable kernels are convolution kernels K : R n &#8594; R d out &#215;d in satisfying the steerability constraint, where n is the dimensionality of the space K&#240;g &#193; x&#222; &#188; &#961; out &#240;g&#222;K&#240;x&#222;&#961; in &#240;g&#222; &#192;1</p><p>(3)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Problem statement</head><p>This paper considers behavior cloning for planar pick and place problems. These problems are planar in the sense that the observation is a top-down image and the pick and place actions are motions to coordinates in the plane. Given a set of demonstrations that contains a sequence of one or more observation-action pairs (o t , a t ), the objective is to infer a policy p(a t |o t ) where the action a t = (a pick , a place ) describes both the pick and place components of action, and the observation o t describes the current state in terms of a topdown image of the workspace.</p><p>Our model will encode this policy by factoring p(a pick |o t ) and p(a place |o t , a pick ) and representing them as two separate neural networks. This policy be can used to solve tasks that are solvable in a single time step (i.e., a single pick and place action) as well as tasks that require multiple pick and place actions to solve. a pick and a place are parameterized in terms of SE( <ref type="formula">2</ref>) coordinates (u, v, &#952;), where u, v denote the pixel coordinates of the gripper position and &#952; denotes gripper orientation. &#952; pick is defined with respect to the world frame and &#952; place is the delta action between the pick pose and place pose.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Transporter network</head><p>Before describing Equivariant Transporter Net, we analyze the original Transporter Net <ref type="bibr">(Zeng et al., 2021)</ref> architecture from a different perspective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Description of transporter net</head><p>Transporter Network <ref type="bibr">(Zeng et al., 2021)</ref> solves the planar pick and place problem using the architecture shown in Figure <ref type="figure">3</ref>.</p><p>The pick network f pick : o t 1p(u, v) maps and image o t onto a probability distribution p(u, v) over pick position &#240;u, v&#222; 2 R 2 . The output pick position a * pick is calculated by maximizing f pick (o t ) over (u, v). (Since Zeng et al. ( <ref type="formula">2021</ref>) uses suction cups to pick, that work ignores pick orientation.) The place position and orientation is calculated as follows. First, an image patch c centered on a * pick is cropped from o t to represent the pick action as well as the object. Then, the crop c is rotated n times to produce a stack of n rotated crops. We denote this stack of crops as</p><p>where we refer to R n as the "lifting" operator of C n . Then, R n &#240;c&#222; is encoded using a neural network &#968;. The original image, o t , is encoded by a separate neural network f. The distribution over place location is evaluated by taking the cross-correlation between &#968; and f,</p><p>where &#968; is applied independently to each of the rotated channels in R n &#240;c&#222;. Place position and orientation is calculated by maximizing f place over the pixel position (for position) and the orientation channel (for orientation).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Analysis of transporter net</head><p>The model architecture described above gives Transporter Network the following equivariance property.</p><p>Proposition 1. The Transporter Net place network f place is C n -equivariant. That is, given g 2 C n , object image crop c, and scene image o t , f place o t , T 0 g &#240;c&#222; &#188; &#961; reg &#240;&#192;g&#222;f place &#240;o t , c&#222;:</p><p>Proposition 1 expresses the following intuition. A rotation of g applied to the orientation of the object to be picked results in a &#192;g change in the placing angle, which is represented by a permutation along the channel axis of the placing feature maps. We denote the permutation in the channel space as &#961; reg (&#192;g). This is a symmetry over the cyclic group C n 4 SO(2) which is encoded directly into the model. It enables it to immediately generalize over different orientations of the object to be picked and thereby improves sample efficiency.</p><p>To prove Proposition 1, we start with some common lemmas. In order to understand continuous rotations of image data, it is helpful to consider a k-channel image as a mapping f : R 2 &#8594; R k where the input R 2 defines the pixel space. We consider images centered at (0,0) and for noninteger values (x, y) we consider f(x, y) to be the interpolated pixel value. Similarly, let K : R 2 &#8594; R l&#215;k be a convolutional kernel where k is the number of the input channels and l is the number of the output channels. Although the input space is R 2 , we assume the kernel is r &#215; r pixels and K(x, y) is zero outside this set. The convolution can then be expressed by</p><p>Without loss of generality, assume that f : R 2 &#8594; R and define f : R 2 &#8594; R n to be the n-fold duplication of f such that</p><p>Figure <ref type="figure">3</ref>. The architecture of transporter net.</p><p>For such inputs and kernels, we have the following permutation equivariance.</p><p>Lemma 1.</p><p>and it is clear that permuting the 1 &#215; 1 kernels K i also permutes h i , so &#961; reg &#240;g&#222;h &#188; &#240;&#961; reg &#240;g&#222; K&#222;+ f as desired.</p><p>We require one more lemma on the equivariance of the lifting operator R n .</p><p>Lemma 2.</p><p>Proof of Proposition 1, we prove the C n -place equivariance of Transporter Net under rotations of the picked object,</p><p>Proof. Since &#968; is applied independently to each of the rotated channels in R n &#240;c&#222;, we denote</p><p>Since &#968; n applies &#968; on each component, it is equivariant to the permutation of components and thus the above equation becomes</p><p>Finally applying Lemma 1 gives</p><p>The main idea of the proof is shown in Figure <ref type="figure">4</ref>. Namely, &#968;&#240;R n &#240;&#193;&#222;&#222; is equivariant in the sense that rotating the crop c induces a cyclic shift in the channels of the output. Formally, &#968;&#240;R n &#240;T 0 g c&#222;&#222; &#188; &#961; reg &#240;&#192;g&#222;&#968;&#240;R n &#240;c&#222;&#222;. Noting that a permutation of the filters K in the convolution K + f(o t ) induces the same permutation in the output feature maps completes the proof. Here, &#968; is a simple CNN with no rotational equivariance. The equivariance results from the lifting operator R n .</p><p>However, only the place network of Transporter Net has the C n -equivariance. Instead, our proposed method incorporates not only the rotational equivariance in the pick network but also C n &#215; C n -equivariance in the place network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Equivariant transporter</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Equivariant pick</head><p>Our approach to the pick network is similar to that in Transporter Net <ref type="bibr">Zeng et al. (2021)</ref> except that: (1) we explicitly encode the pick symmetry into the pick networks, thereby making pick learning more sampleefficient;</p><p>(2) we consider the pick orientation so that we can use parallel-jaw grippers rather than just suction grippers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1.">Model.</head><p>We propose an equivariant model for detecting the planar pick pose. First, we decompose the learning process of a pick 2 SE(2) into two parts,</p><p>where p(u, v) denotes the probability of success when a pick exists at pixel coordinates u, v and p(&#952;|(u, v)) is the probability that the pick at u, v should be executed with a gripper orientation of &#952;. The distributions p(u, v) and p(&#952;|(u, v)) are modeled as two neural networks:</p><p>f &#952; &#240;o t , &#240;u, v&#222;&#222;1p&#240;&#952;j&#240;u, v&#222;&#222;:</p><p>Given this factorization, we can query the maximum of p(a pick ) by evaluating &#240;b u, b v&#222; &#188; arg max &#240;u, v&#222; &#240;p&#240;u, v&#222;&#222; and then b &#952; &#188; arg max &#952; &#240;p&#240;&#952;jb u, b v&#222;&#222;. This is illustrated in Figure <ref type="figure">5</ref>. The bottom of Figure <ref type="figure">5</ref> shows the maximization of f p at a * pick . The right side shows the evaluation of f &#952; for the image patch centered at a * pick .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2.">Pick symmetry.</head><p>There are two equivariant relationships that we would expect to be satisfied for planar picking:</p><p>where s is the shift operator and satisfies s(g)f(x) = f(x + g). Equation ( <ref type="formula">11</ref>) states that the pick location distribution found in an image rotated by g 2 SO(2), (LHS of equation ( <ref type="formula">11</ref>)), should correspond to the distribution found in the original image subsequently rotated by g, (RHS of equation ( <ref type="formula">11</ref>)).</p><p>Equation ( <ref type="formula">12</ref>) says that the pick orientation distribution at the rotated grasp point T 0 g &#240;u, v&#222; in the rotated image T 0 g &#240;o t &#222; (LHS of Equation ( <ref type="formula">12</ref>)) should be shifted by g relative to the grasp orientation at the original grasp points in the original image (RHS of equation ( <ref type="formula">12</ref>)).</p><p>We encode both f p and f &#952; using equivariant convolutional layers <ref type="bibr">(Weiler and Cesa, 2019)</ref> which constrain the models to represent only those functions that satisfy equations ( <ref type="formula">11</ref>) and ( <ref type="formula">12</ref>). Specifically, we select the trivial representation as the output type for f p and the regular representation as the output type for f &#952; , which is a special case 2 of equation ( <ref type="formula">12</ref>)</p><p>5.1.3. Gripper orientation using the quotient group. A key observation in planar picking is that, for many robots, the gripper is bilaterally symmetric, that is, grasp outcome is invariant when the gripper is rotated by &#960;. We can encode this additional symmetry to reduce redundancy and save computational cost using the quotient group SO(2)/C 2 which identifies orientations that are &#960; apart. When using this quotient group for gripper orientation, s(g) in equation ( <ref type="formula">12</ref>) is replaced with s(g mod &#960;) 3 and &#961; reg in equation ( <ref type="formula">13</ref>) is replaced with &#961;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Equivariant place</head><p>Assumes that the object does not move during picking, given the picked object represented by the image patch c centered on a pick , the place network models the distribution of a place = (u place , v place , &#952; place ) by</p><p>where p(a place |o t , a pick ) denotes the probability that the object at a pick in scene o t should be placed at a place .</p><p>Our place model architecture closely follows that of Transporter Net <ref type="bibr">(Zeng et al., 2021)</ref>. The main difference is that we explicitly encode equivariance constraints on both f and &#968; networks. As a result of this change: (1) we are able to simplify the model by transposing the lifting operation R n and the processing by f; (2) our new model is equivariant with respect to a larger symmetry group C n &#215; C n , compared to Transporter Net which is only equivariant over C n . 5.2.1. Equivariant f and &#968;. We explicitly encode both f and &#968; as equivariant models that satisfy the following constraints:</p><p>for g 2 SO(2). The equivariance constraint of equation ( <ref type="formula">15</ref>) says that when the input image rotates, we would expect the place location to rotate correspondingly. This constraint helps the model generalize across place orientations. The constraint of equation ( <ref type="formula">16</ref>) says that when the picked object rotates (represented by the image patch c), then the place orientation should correspondingly rotate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.2.">Place model.</head><p>When the equivariance constraint of equation ( <ref type="formula">16</ref>) is satisfied, we can exchange R n (the lifting operation) with &#968;:</p><p>This equality is useful because it means that we only need to evaluate &#968; for one image patch and rotate the feature map rather than processing the stack of image patches R n &#240;c&#222;-something that is computationally cheaper. The resulting place model is then:</p><p>where Equation section 18 substitutes &#936;&#240;c&#222; &#188; R n &#240;&#968;&#240;c&#222;&#222; to simplify the expression. Here, we use f 0 place to denote Equivariant Transporter Net defined using equivariant f and &#968; in contrast to the baseline Transporter Net f place of equation ( <ref type="formula">5</ref>). Note that both f place and f 0 place satisfy Proposition 1. However, f place accomplishes this by symmetrizing a non-equivariant network (i.e., evaluating &#968;&#240;R n &#240;c&#222;&#222;) whereas our model f 0 place encodes the symmetry directly into &#968;.  <ref type="formula">15</ref>) and ( <ref type="formula">16</ref>). Essentially, we go from the C n -place symmetric model to a C n &#215; C n -place symmetric model.</p><p>That is, given rotations g 1 2 C n of the picked object and g 2 2 C n of the scene, we have that</p><p>Proposition 2 is illustrated in Figure <ref type="figure">6</ref>. The top of Figure <ref type="figure">6</ref> going left to right shows the rotation of both the object by g 1 (in orange) and the place pose by g 2 (in green). The LHS of equation ( <ref type="formula">19</ref>) evaluates f 0 place for these two rotated images. The lower left of Figure <ref type="figure">6</ref> shows</p><p>Going left to right at the bottom of Figure <ref type="figure">6</ref> shows the pixelrotation by T 0 g 2 and the channel permutation by g 2 &#192; g 1 (RHS of equation ( <ref type="formula">19</ref>)).</p><p>To prove Proposition 2, we introduce one more lemma. Lemma 3.</p><p>Proof. We evaluate the left-hand side of equation ( <ref type="formula">20</ref>):</p><p>Re-indexing the sum with y is by definition</p><p>as desired.</p><p>Proof of Proposition 2 Recall &#936;&#240;c&#222; &#188; &#968;&#240;R n &#240;c&#222;&#222;. We now prove Proposition 2,</p><p>Proof. We first prove the equivariance under rotations of the placement o t . We claim</p><p>Evaluating the left-hand side of equation ( <ref type="formula">21</ref>),</p><p>&#188; T 0 g &#961; reg &#240;g&#222;&#240;&#936;&#240;c&#222;+f&#240;o t &#222;&#222;&#240;Lemma 1&#222; &#188; T reg g &#240;&#936;&#240;c&#222;+f&#240;o t &#222;&#222;:</p><p>In the last step, T reg g &#188; &#961; reg &#240;g&#222;T 0 g &#188; T 0 g &#961; reg &#240;g&#222; since T 0 g and &#961; reg (g) commute as &#961; reg (g) acts on the channel space and T 0 g acts on the base space. This proves the claim of equation ( <ref type="formula">21</ref>). Recall &#936;&#240;c&#222; &#188; R n &#240;&#968;&#240;c&#222;&#222;. Using the equivariance of &#968;, Proposition 1 could be reformulated as</p><p>Evaluating the left-hand side of equation ( <ref type="formula">22</ref>),</p><p>&#240;equivariance of &#968;&#222; &#188; &#961; reg &#240;&#192;g&#222;&#240;&#968;&#240;R n &#240;c&#222;&#222;+f&#240;o t &#222;&#222; &#240;Proposition 1&#222; &#188; &#961; reg &#240;&#192;g&#222;&#240;R n &#240;&#968;&#240;c&#222;&#222;+f&#240;o t &#222;&#222; &#240;equivariance of &#968;&#222; &#188; &#961; reg &#240;&#192;g&#222;&#240;&#936;&#240;c&#222;+f&#240;o t &#222;&#222; Combining equation (21) with equation ( <ref type="formula">22</ref>) realizes the Proposition 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2.">Translational symmetry.</head><p>Note that in addition to the two rotational symmetries enforced by our model, it also has translational symmetry. Since the rotational symmetry is realized by additional restrictions to the weights of kernels of convolutional networks, the rotational symmetry is in addition to the underlying shift equivariance of the convolutional network. Thus, the full symmetry group enforced is the group generated by C n &#215; C n &#215; &#240;R 2 , &#254;&#222;. Equivariant neural networks learn effectively on a lower dimensional space, the equivalence classes of samples under the group action.</p><p>5.3.3. From C n &#215; C n -place symmetry to SO(2) &#215; SO(2). The above place symmetry is limited to the cyclic group due to the role of R n , though as n &#8594; &#8734;, C n equals SO(2). We show the generalization of the C n &#215; C n -place symmetry and SO(2) &#215; SO(2) place symmetry below. Given g 2 G for G 4SO(2), an equivariant model f satisfying f&#240;T 0 g &#240;o t &#222;&#222; &#188; T 0 g &#240;f&#240;o t &#222;&#222; and a function &#936; : c1K satisfying the equivariant constraint &#936;&#240;T 0 g c&#222; &#188; T 0 g &#936;&#240;c&#222;, where c is the crop 2R 2 and K : R 2 &#8594; R d out &#215;d trivial is a 2D steerable kernel with trivial representation as the input type. The cross-correlation between &#936;&#240;c&#222; and f&#240;T 0 g &#240;o t &#222;&#222; satisfies</p><p>Proposition 3 states that to satisfy the cross-type place symmetry, one necessary condition is that the output of &#936; is a steerable kernel. It generalizes Proposition 2 to either C n or SO(2). In fact, &#936;&#240;c&#222; &#188; R n &#240;&#968;&#240;c&#222;&#222; combining the lift operator R n and the equivariant constraint of &#968; shown in equation ( <ref type="formula">16</ref>) is a special case of &#936;&#240;c&#222;. R n : R 2 &#8594; K outputting a steerable kernel K that takes the regular representation of C n as the output type and satisfies the steerability constraint of equation (3). When using irreducible representations as the output type, we can instantiate &#961; out (g 2 &#192; g 1 ) in RHS of equation ( <ref type="formula">23</ref>) as &#961; irrep (g 2 &#192; g 1 ) which is equivalent to s(g 2 &#192; g 1 ) after Inverse Fourier Transform.</p><p>To prove Proposition 3, we first introduce another lemma.</p><p>Lemma 4. A 2D steerable kernel K : R 2 &#8594; R d out &#215;d trivial satisfies</p><p>Proof. Recall that &#961; 0 (g) is an identity mapping. Substituting &#961; in with &#961; 0 (g) and g &#192;1 with g in the steerability constraint K(g &#193; x) = &#961; out (g)K(x)&#961; in (g) &#192;1 shown in equation ( <ref type="formula">3</ref>) completes the proof.</p><p>Lemma 4 states that when the input type is the trivial representation, a spatial rotation of the steerable kernel is the same as the inverse channel space transformation. With Lemma 4 in hand, we start the proof of Proposition 3</p><p>Proof. Similar to the proof of Proposition 2, we first show the equivariance under rotations of the placement o t . We claim</p><p>Starting from the left-hand side of equation ( <ref type="formula">25</ref>),</p><p>Then, we propose the equivariance under rotations of the picked object as</p><p>Evaluating the left-hand side of equation ( <ref type="formula">26</ref>),</p><p>Combining equation ( <ref type="formula">25</ref>) with equation ( <ref type="formula">26</ref>) realizes the Proposition 3.</p><p>Note that Proposition 3 gives the way to realize the SO(2) version of our model and provide some insights to extend it to 3D signals without limitations. That is, generating the 3D dynamic steerable kernels from the crop signal. But in this work, we primarily focus on the discrete C n group since it is easy to compare with the baseline Transporter Net on Ravens-10 Benchmark.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Analyzing equivariance under Proposition 2</head><p>We summarize some important properties from the larger symmetry group of our place network and provide an intuitive explanation for each one. Recall that Proposition 2 states</p><p>Then we have the following properties:</p><p>5.4.1. Equivariance property. Setting either g 1 = 0 or g 2 = 0 we get, respectively,</p><p>These show the equivariance of our network f place under either a rotation g 2 C n of the object or the placement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.2.">Invariance property.</head><p>Setting g 1 = g 2 , we get</p><p>This equation demonstrates that a rotation g on the whole observation o t including the objects does not change the placing angle but rotates the placing location by g. Although data augmentation could help non-equivariant models learn this property, our networks observe it by construction. Note that for the discrete group, data augmentation propagates to every element within the group.</p><p>5.4.3. Relativity property. Related to equation ( <ref type="formula">27</ref>), we also have</p><p>This equation defines the dual relationship between a rotation on c by g and an inverse rotation &#192;g on o t . Intuitively, c could be considered as the L-shaped block and o t can be regarded as the L-shaped slot. A rotation on the picked object is equivariant to an inverse rotation on the placement under some transformation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Goal-conditioned equivariant transporter network</head><p>The goal-conditioned pick-place task is an important branch in learning manipulation skills where the goal could be represented as language instructions, images, or other customized definitions.</p><p>Seita et al. (2021) extended Transporter Net to solve image-based goal-conditioned tasks. In this setting, the goal is represented explicitly as an image that is part of the problem input rather than implicitly as part of the observation. Two goal-conditioned architectures are proposed in Seita et al. (2021). Transporter-Goal-Stack stacks the current o t and goal o g images channel-wise and passes it as input through a standard Transporter Network. Transporter-Goal-Split processes the goal image o g through a separate Fully Convolution Network f goal to generate dense features to be combined with dense features of f query (o t ) using the Hadamard product to infer the goal-conditioned pick</p><p>and evaluate the goal-conditioned place with</p><p>where f query [a pick ] denotes the crop of the dense feature map f query centered on a pick and &#9737; is the Hadamard product.</p><p>Since the pick-place symmetries also exist in goalconditioned tasks, we realize the goal-conditioned equivariant transporter with some simple modifications. Denote k as the channel-wise concatenation, the C n -equivariance of the picking network holds when stacking o t and o g as the input:</p><p>The C n &#215; C n -equivariance of the placing model also holds</p><p>Based on the two equations above, we define Equivariant-Transporter-Goal-Stack to solve goalconditioned tasks.</p><p>5.6. Model architecture details 5.6.1. Pick model f p (equation ( <ref type="formula">9</ref>)). The input to f p is a 4channel RGB-D image o t 2 R 4&#215;H&#215;W . The output is a feature map p&#240;u, v&#222; 2 R H&#215;W which encodes a distribution over pick location. f p is implemented as an 18-layer equivariant residual network with a U-Net <ref type="bibr">(Ronneberger et al., 2015)</ref> as the main block. The U-net has eight residual blocks (each block contains two equivariant convolution layers <ref type="bibr">(Weiler and Cesa, 2019)</ref> and one skip connection): four residual blocks <ref type="bibr">(He et al., 2016)</ref> are used for the encoder and the other four residual blocks are used for the decoder. The encoding process trades spatial dimensions for channels with max-pooling in each block; the decoding process upsamples the feature embedding with bilinear-upsampling operations. The first layer maps the trivial representation of o t to regular representation and the last equivariant layer transforms the regular representation back to the trivial representation, followed by image-wide softmax. ReLU activations <ref type="bibr">(Nair and Hinton, 2010)</ref> are interleaved inside the network. 5.6.2. Pick model f &#952; (equation ( <ref type="formula">10</ref>)). Given the picking location (u*, v*), the pick angle network f &#952; takes as input a crop c 2 R 4&#215;H 1 &#215;W 1 centered on (u*, v*) and outputs the distribution p&#240;&#952;ju, v&#222; 2 R n=2 , where n is the size of the rotation group (i.e., n = |C n |). The first layer maps the trivial representation of c to a quotient regular representation followed by three residual blocks containing max-pooling operators. This goes to two equivariant convolution layers and then to an average pooling layer. 5.6.3. Place models f and &#968;. Our place model has two equivariant convolution networks, f and &#968;, and both have similar architectures to f p . The network f takes as input a zero-padded version of the 4-channel RGBD observation o t , pad&#240;o t &#222; 2 R 4&#215;&#240;H&#254;d&#222;&#215;&#240;W &#254;d&#222; , and generates a dense feature map, f&#240;pad&#240;o t &#222;&#222; 2 R &#240;H&#254;d&#222;&#215;&#240;W &#254;d&#222; , where d is the padding size. The network &#968; takes as input the image patch c 2 R 4&#215;H 2 &#215;W 2 and outputs &#968;&#240;c&#222; 2 R H 2 &#215;W 2 . After applying rotations of C n to &#968;(c), the transformed dense embeddings &#936;&#240;c&#222; 2 R n&#215;H 2 &#215;W 2 are cross-correlated with f(pad(o t )) to generate the placing action distribution p&#240;a place jo t , a pick &#222; 2 R n&#215;H&#215;W , where the channel axis n corresponds to placing angles, 2&#960;i/n for 0 &#8804; i &lt; n. 5.6.4. Group types and sizes. The networks f p , &#968;, and f: &#961; 0 &#8594; &#961; 0 , which are all defined using C 6 regular representations in the intermediate layers. The ablation study of the group size of the latent feature is discussed in the Experiment section. The network f &#952; : &#961; 0 &#8594; &#961; quot is defined using the quotient representation C 36 /C 2 , which corresponds to the number of allowed pick orientations. The lift operator R n is implemented with C 36 cyclic group, which allows 36 different place orientations. Both the number of allowed pick and place orientations are hyper-parameters and could be selected flexibly based on the task precision. Our choice of the &#960;/18 discretization, that is, 18 bilateral-symmetric pick orientations and 36 place orientations, follows the settings of Ravens-10 benchmark <ref type="bibr">(Zeng et al., 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.5.">Training details. We train Equivariant Transporter</head><p>Network with the Adam <ref type="bibr">(Kingma and Ba, 2014)</ref> optimizer with a fixed learning rate of 10 &#192;4 . It takes about 0.8 s 4 to complete one SGD step with a batch size of one on an NVIDIA Tesla V100 SXM2 GPU. Compared with the baseline transporter net which takes around 0.6 s to complete one SGD step on the same setting, the equivariant constraint on the weight updating increases 33% computation load. In fact, Equivariant Transporter Net converges faster than the baseline Transporter Net as shown in Figure <ref type="figure">9</ref>. This is due to that the larger symmetry group results in a smaller dimensional sample space and thus better coverage by the training data. For each task, we train a single-policy network and evaluate the performance every 1k steps on 100 unseen tests. On most tasks, the best performance is achieved in less than 10k SGD steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Experiments</head><p>We evaluate Equivariant Transporter using the Ravens-10 Benchmark <ref type="bibr">(Zeng et al., 2021)</ref> and our variations thereof.</p><p>6.1. Tasks 6.1.1. Ravens-10 tasks. Ravens-10 is a behavior cloning simulation benchmark for manipulation, where each task owns an oracle that can sample expert demonstrations from the distribution of successful picking and placing actions with access to the ground-truth pose of each object. The 10 tasks of Ravens can be classified into three categories: Single-object manipulation tasks (block-insertion, alignbox-corner); multiple-object manipulation tasks (placered-in-green, towers-of-Hanoi, stack-block-pyramid, palletizing-boxes, assembling-kits, packing-boxes); deformable-object manipulation task (manipulating-rope, sweeping-piles).</p><p>Here we provide a short description of Ravens-10 Environment, we refer readers to <ref type="bibr">Zeng et al. (2021)</ref> for details. The poses of objects and placements in each task are randomly sampled in the workspace without collision. Performance on each task is evaluated in one of two ways:</p><p>(1) pose: translation and rotation error relative to target pose is less than a threshold &#964; = 1 cm and &#969; = &#960;/12, respectively. Tasks: block-insertion, towers-of-Hanoi, place-red-in-green, align-box-corner, stack-block-pyramid, assembling-kits. Partial scores are assigned to multipleaction tasks. (2) Zone: Ravens-10 discretizes the 3D bounding box of each object into 2 cm 3 voxels. The Total reward is calculated by # of voxels in target zone/total # of voxels. Tasks: palletizing-boxes, packing-boxes, manipulating-cables, sweeping-piles. Note that pushing objects could also be parameterized with a pick and a place that correspond to the starting pose and the ending pose of the end effector.</p><p>1. Block-insertion: Picking up an L-shape block and placing it into an L-shaped fixture. 2. Place-red-in-green: picking up red cubes and placing them into green bowls. There could be multiple bowls and cubes with different colors.</p><p>3. Towers-of-Hanoi: sequentially picking up disks and placing them into pegs such that all three disks initialized on one peg are moved to another, and that only smaller disks can be on top of larger ones. 4. Align-box-corner: picking up a randomly sized box and placing it to align one of its corners to a green L-shaped marker labeled on the tabletop. This task requires precision and generalization ability to new box sizes. 5. Stack-block-pyramid: sequentially picking up six blocks and stacking them into a pyramid of 3-2-1. 6. Palletizing-boxes: picking up 18 boxes and stacking them on top of a pallet. 7. Assembling-kits: picking five shaped objects (randomly sampled with replacement from a set of 20 objects, where 14 objects are used during training and six objects are held out for testing) and fitting them to corresponding silhouettes of the objects on a board. This task requires generalizing to new objects. 8. Packing-boxes: picking and placing randomly sized boxes tightly into a randomly sized container. 9. Manipulating-rope: manipulating a deformable rope such that it connects the two endpoints of an incomplete 3-sided square (colored in green). 10. Sweeping-piles: pushing piles of small objects (randomly initialized) into a desired target goal zone on the tabletop marked with green boundaries. The task is implemented with a pad-shaped end effector.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Ravens-10 tasks modified for the paralleljaw gripper</head><p>We select five tasks (block-insertion, align-box-corner, place-red-in-green, stack-block-pyramid, palletizingboxes) from Ravens-10 and replaced the suction cup with the Franka Emika gripper, which requires additional picking angle inference. Figure <ref type="figure">7</ref> illustrates the initial state and completion state for each of these five tasks. For each of these five tasks, we defined an oracle agent. Since the Transporter Net framework assumes that the object does not move during picking, we defined these expert generators such that this was the case.</p><p>6.2.1. Goal-conditioned tasks. We design four image-based goal-conditioned tasks (goal-conditioned block-insertion, goal-conditioned block-pyramid, goal-conditioned four blocks-no, goal-conditioned cable-align) based on ravens-10, as shown in Figure <ref type="figure">8</ref>. For each of the four tasks, objects are generated with random poses on the workspace and there is no placement in the observation o t . The robot must use pick-place actions to manipulate the objects to the target pose specified in the goal images. For the goal-conditioned cable-align task, the robot needs to align the rope to the straight line shown in the goal image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Training and evaluation</head><p>6.3.1. Training. For each task, we produce a dataset of n expert demonstrations, where each demonstration contains a sequence of one or more observation-action pairs &#240;o t , a t &#222; (or &#240;o t , o g , a t &#222; for goal-conditioned tasks). Each action a t contains an expert picking action a pick and an expert placing action a place . We use a t to generate one-hot pixel maps as the ground-truth labels for our picking model and placing model. The models are trained using a cross-entropy loss.</p><p>6.3.2. Metrics. We measure performance the same way as it was measured in Transporter Net <ref type="bibr">(Zeng et al., 2021</ref>)-using a metric in the range of 0 (failure) to 100 (success). Partial scores are assigned to multiple-action tasks. For example, in the block-stacking task where the agent needs to construct a 6-block-pyramid, each successful rearrangement is credited with a score of 16.667. We report the highest validation performance during training, averaging over 100 unseen tests for each task. 6.3.3. Baselines. We compare our method against Transporter Net <ref type="bibr">(Zeng et al., 2021)</ref> as well as the following baselines previously used in the Transporter Net paper <ref type="bibr">(Zeng et al., 2021)</ref>. Form2Fit <ref type="bibr">(Zakka et al., 2020)</ref> introduces a matching module with the measurement of L 2 distance of high-dimension descriptors of picking and placing locations. Conv-MLP is a common end-to-end model <ref type="bibr">(Levine et al., 2016)</ref> which outputs a pick and a place using convolution layers and MLPs (multi-layer perceptrons). GT-State MLP is a regression model composed of an MLP that accepts the ground-truth poses and 3D bounding boxes of objects in the environment. GT-State MLP 2-step outputs the actions sequentially with two MLP networks and feeds a pick to the second step. All regression baselines learn mixture densities <ref type="bibr">(Bishop, 1994)</ref> with loglikelihood loss.</p><p>For goal-conditioned tasks, we compare two baselines Equivariant-Transporter-Goal-Stack with Transporter-Goal-Stack and Transporter-Goal-Split. 6.3.4. Adaptation of transporter net for picking using a parallel-jaw gripper. In order to compare our method against Transporter Net for the five parallel-jaw gripper tasks, we must modify Transporter to handle the gripper. We accomplish this by <ref type="bibr">(Zeng, Song, Welker, Lee, Rodriguez and Funkhouser, 2018a)</ref> lifting the input scene image o t over C n , producing a stack of differently oriented input images which is provided as input to the pick network f pick . The results are then counter-rotated at the output of f pick and each channel corresponds to one pick orientation.</p><p>6.4. Results for the Ravens-10 benchmark tasks 6.4.1. Task success rates. Table <ref type="table">1</ref> shows the performance of our model on the Raven-10 tasks for different numbers of demonstrations used during training. All tests are evaluated on unseen configurations, that is, random poses of objects, and three tasks (align-box-corner, assembling-kits, packingbox) use unseen objects. Our proposed Equivariant Transporter Net outperforms all the other baselines in nearly all cases. The amount by which our method outperforms others is largest when the number of demonstrations is smallest, that is, with only one or 10 demonstrations. With just 10 demonstrations per task, our method can achieve &#8805;95% success rate on 7/10 tasks. With either one or 10 demonstrations, the performance of our model is better than baselines trained with 1000 demonstrations on 5/ 10 tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.2.">Training efficiency.</head><p>Another interesting consequence of our more structured model is that training is much faster. Figure <ref type="figure">9</ref> shows task success rates as a function of the number of SGD steps for two tasks (Block-Insertion and Sweeping-Piles). Our equivariant model converges much faster in both cases. It indicates that the large symmetry group enables the model to learn on a low-dimension space and achieve better convergence speed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5.">Results for parallel-jaw gripper tasks</head><p>Table 2 compares the success rate of Equivariant Transporter with the baseline Transporter Net on paralleljaw gripper tasks. Again, our method outperforms the baseline in nearly all cases. One interesting observation that can be made by comparing Tables <ref type="table">1</ref> and <ref type="table">2</ref> is that both Equivariant Transporter and baseline Transporter do better on the gripper versions of the task compared to the original Ravens-10 versions. This is likely caused by the fact that the expert demonstrations we developed for the gripper version task have fewer stochastic gripper poses during pick than the case in the original Ravens-10 benchmark.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.6.">Results for goal-conditioned equivariant transporter net</head><p>Table <ref type="table">3</ref> compares the performance of Equivariant-Transporter-Goal-Stack with the two baselines (Transporter-Goal-Stack, Transporter-Goal-Split) for goalconditioned tasks. Our model gets better performance than the baselines on all the tasks. In most cases, the performance gap between our model and the baselines becomes larger as the number of demonstrations decreases. It shows the sample efficiency of our proposed method could be used to solve goal-conditioned tasks effectively.</p><p>6.7. Ablation study 6.7.1. Ablations. We performed an ablation study to evaluate the relative importance of the equivariant models in  Table <ref type="table">1</ref>. Performance comparisons on ravens-10 benchmark (suction gripper). success rate (mean%) vs. the number of demonstration episodes <ref type="bibr">(1, 10, 100, or 1000)</ref> used in training. Method Block-insertion Place-red-in-green Towers-of-Hanoi Align-box-corner Stack-block-pyramid 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Equivariant transporter 100 100 100 100 98.5 100 100 100 88.1 95.7 100 100 41.0 99.0 100 100 34.6 80.0 90.8 95.1 Transporter network 100 100 100 100 84.5 100 100 100 73.1 83.9 97.3 98.1 35.0 85.0 97.0 98.0 13.3 42.6 56.2 78.2 Form2Fit 17.0 19.0 23.0 29.0 83.4 100 100 100 3.6 4.4 3.7 7.0 7.0 2.0 5.0 16.0 19.7 17.5 18.5 32.5 Conv. MLP 0.0 5.0 6.0 8.0 0.0 3.0 25.5 31.3 0.0 1.0 1.9 2.1 0.0 2.0 1.0 1.0 0.0 1.8 1.7 1.7 GT-state MLP 4.0 52.0 96.0 99.0 0.0 0.0 3.0 82.2 10.7 10.7 6.1 5.3 47.0 29.0 29.0 59.0 0.0 0.2 1.3 15.3 GT-state MLP 2-step 6.0 38.0 95.0 100 0.0 0.0 19.0 92.8 22.0 6.4 5.6 3.1 49.0 12.0 43.0 55.0 0.0 0.8 12.2 17.5 Palletizing-boxes Assembling-kits Packing-boxes Manipulating-rope Sweeping-piles 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Equivariant transporter 75.3 98.9 99.6 99.6 63.8 90.6 98.6 100 98.3 99.4 99.6 100 31.0 85.0 92.3 98.4 97.9 99.5 100 100 Transporter network 63.2 77.4 91.7 97.9 28.4 78.6 90.4 94.6 56.8 58.3 72.1 81.3 21.9 73.2 85.4 92.1 52.4 74.4 71.5 96.1 Form2Fit 21.6 42.0 52.1 65.3 3.4 7.6 24.2 37.6 29.9 52.5 62.3 66.8 11.9 38.8 36.7 47.7 13.2 15.6 26.7 38.4 Conv. MLP 31.4 37.4 34.6 32.0 0.0 0.2 0.2 0.0 0.3 9.5 12.6 16.1 3.7 6.6 3.8 10.8 28.2 48.4 44.9 45.1 GT-state MLP 0.6 6.4 30.2 30.1 0.0 0.0 1.2 11.8 7.1 1.4 33.6 56.0 5.5 11.5 43.6 47.4 7.2 20.6 63.2 74.4 GT-state MLP 2-step 0.6 9.6 32.8 37.5 0.0 0.0 1.6 4.4 4.0 3.5 43.4 57.1 6.0 8.2 41.5 58.7 9.7 21.4 66.2 73.9 Best performances are highlighted in bold.     </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.8.">Experiments on a physical robot</head><p>We evaluated Equivariant Transporter on a physical robot in our lab. The simulator was not used in this experiment-all demonstrations were performed on the real robot.</p><p>6.8.1. Setup. We used a UR5 robot with a Robotiq-85 end effector. The workspace was a 40 cm &#215; 40 cm region on a table beneath the robot (see Figure <ref type="figure">12</ref>). The observations o were 200 &#215; 200 depth images obtained using an Occipital Structure Sensor that was mounted pointing directly down at the table (see Figure <ref type="figure">11</ref>).</p><p>6.8.2. Tasks. We evaluated Equivariant Transporter on three of the Ravens-10 gripper-modified tasks: blockinsertion, placing boxes in bowls, and stacking a pyramid of blocks. Since our sensor only measures depth (and not RGB), we modified the box-in-bowls task such that box color was irrelevant to success, that is, the task is simply to put any box into a bowl.</p><p>6.8.3. Demonstrations. We obtained 10 human demonstrations of each task. These demonstrations were obtained by releasing the UR5 brakes and pushing the arm physically so that the harmonic actuators were back-driven.</p><p>6.8.4. Training and evaluation. For each task, a singlepolicy agent was trained for 10k SGD steps. During testing, objects were randomly placed on the table. A task was considered to have failed when a single incorrect pick or place occurred. We evaluated the model on 20 unseen configurations of each task.</p><p>6.8.5. Results. Table <ref type="table">5</ref> shows results from 20 runs of each of the three tasks. Notice that the success rates here are higher than they were for the corresponding tasks performed in simulation (Table <ref type="table">2</ref>). This is likely caused by the fact that the criteria for task success in simulation (less than 1 cm translational error and less than &#960;/12 rotation error were more conservative than is actually the case in the real world.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.9.">Discussion</head><p>Equivariant networks are built on top of conventional convolution kernels with the steerability constraint. It does not break the mechanism of weight sharing and updating and thus keeps the robustness of learning and reasoning of CNN. As shown in Figure <ref type="figure">11</ref>, Equivariant Transporter Net can handle real-sensor noise. Frequently, the crop c contains multiple objects. For instance, on the stack-block-pyramid task as shown in Figure <ref type="figure">12</ref>, the crop not only includes the block to be picked but also neighboring blocks or some parts of them. During training, data augmentation also generates images with partially observed objects. For example, on the block-insertion task, it shifts some part of the L-shape block or the slot out of the scene. Some special shapes like elongated objects might be difficult to represent with an image crop and may be suitable for the goal-conditioned version of our model. Some high-precision tasks like gear assembly are more sensitive to discretization and it may be tackled more easily with the SO(2) version of our model. Compared to learning pick and place skills efficiently, the one-shot generalization and sequential decision-making ability of both Transporter Net and Equivariant Transporter Net seem less compelling. As shown in Table <ref type="table">1</ref>, they achieved less than 50% success rate when trained with one demo on the align-box-corner task that requires the agent to generalize the skill to boxes with random sizes and colors during the test. The performances on the stack-blockpyramid task trained with one demo are below 40%  since if there was a collapse, some blocks might be tilted and it yields out-of-distribution data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion and limitations</head><p>This paper explores the symmetries present in the pick and place problem and finds that they can be described by pick symmetry and place symmetry. This corresponds to the group of different pick and place poses. We evaluate the Transporter Network model proposed in <ref type="bibr">Zeng et al. (2021)</ref> and find that it encodes half of the place symmetry (the C nplace symmetry). We propose a novel version of Transporter Net, Equivariant Transporter Net, which we show encodes both types of symmetries. The large symmetry group could also be extended to solve goal-conditioned tasks. We evaluate our model on the Ravens-10 Benchmark and its variations and evaluate against multiple strong baselines. Finally, we demonstrate that the method can effectively be used to learn manipulation policies on a physical robot. One limitation of our framework as it is presented in this paper is that it relies entirely on behavior cloning. A clear direction is to integrate more on-policy learning which we believe would enable us to handle more complex tasks. Other directions of the multi-task language-conditioned equivariant agent, a closed-loop policy, or 3D Equivariant Transporter Net are also interesting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Declaration of conflicting interests</head><p>The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.</p></div></body>
		</text>
</TEI>
