<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Symmetric Models for Visual Force Policy Learning</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>05/13/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10558745</idno>
					<idno type="doi">10.1109/ICRA57147.2024.10610728</idno>
					
					<author>Colin Kohler</author><author>Anuj Shrivatsav Srikanth</author><author>Eshan Arora</author><author>Robert Platt</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[While it is generally acknowledged that force feedback is beneficial to robotic control, applications of policy learning to robotic manipulation typically only leverage visual feedback. Recently, symmetric neural models have been used to significantly improve the sample efficiency and performance of policy learning across a variety of robotic manipulation domains. This paper explores an application of symmetric policy learning to visual-force problems. We present Symmetric Visual Force Learning (SVFL), a novel method for robotic control which leverages visual and force feedback. We demonstrate that SVFL can significantly outperform state of the art baselines for visual force learning and report several interesting empirical findings related to the utility of learning force feedback control policies in both general manipulation tasks and scenarios with low visual acuity.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>There are a variety of manipulation tasks where it is essential to use both vision and force feedback as part of the control policy. Peg insertion with tight tolerances, for example, is a task that is nearly impossible to solve without leveraging force feedback in some form. The classical approach is to use an admittance controller with a remote center of compliance to help the peg slide into the hole <ref type="bibr">[1]</ref>. However, this is a very limited use of force feedback and it seems like it should be possible to use force information in a more comprehensive way. One of the core challenges is the difficulty in simulating the complex force interactions that happen at the robot end effector which primarily depend upon the contact-modelling utilized by the physics engine. While there have been major efforts to improve contact simulations by refining the contact geometry, friction model, and contact constraints <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>, state-of-the-art physics engines often violate real-world physical constraints limiting the applicability of simulation-based models to the real world.</p><p>An obvious alternative approach is to leverage machine learning, i.e. model free reinforcement learning (RL), to obtain force feedback assisted policies. This is in contrast to vision-only RL where the policy only takes visual feedback <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>. In visual force RL, there is the possibility to adapt control policies directly to the mechanical characteristics of the system as they exist in the physical world, without the need to model those dynamics first. However, this assumes that we can run RL online directly in the physical world, something that is hard to do due to the poor sample efficiency of RL. RL is well known to require an enormous amount of data in order to learn even simple policies effectively. While visual force RL might, in 1 Northeastern University, Boston, MA 02115, USA.  <ref type="formula">2</ref>) symmetries in the visual-force policy learning problem. Note that the same rotational symmetries which apply to the image s also apply to the xy components of the force (fxy, mxy) and end effector state (exy).</p><p>principle, be able to learn effective policies, this sample inefficiency prevents us from learning policies directly on physical equipment. In order to improve the sample efficiency of RL in visual force problems, one common approach is to learn a helpful latent representation during a pretraining phase <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>. This generally takes the form of self-supervised robot "play" in the domain of interest that must precede actual policy learning. Unfortunately, this is both cumbersome and brittle as the latent representation does not generalize well outside the situations experienced during the play phase. This is especially prevalent in the visual force domain as the noisy nature of force sensors means there will be many force observations not experienced during pretraining leading to poor latent predictions during policy learning.</p><p>This paper develops an alternative approach to the problem of visual force learning based on exploiting domain symmetries using equivariant learning <ref type="bibr">[13]</ref>. Recently, symmetric neural networks have been shown to dramatically improve the sample efficiency of RL in robotic manipulation domains <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>. However, this work has focused exclusively on visual feedback and has not yet been applied to visual force learning. This is of particular note as the domain symmetries used by these equivariant models are also present in both force and proprioceptive data. This can be seen in Fig. <ref type="figure">1</ref> where the image, force, and proprioceptive observations are all invariant under rotations in the xy plane. This paper makes three main contributions. First, we propose a novel method for visual force policy learning called Symmetric Visual Force Learning (SVFL) which exploits the underlying symmetry of manipulation tasks to improve sample efficiency and performance. Second, we empirically evaluate the importance of force feedback assisted control across a variety of manipulation domains and find that force feedback is helpful for nearly all tasks, not just contact-rich domains like peg insertion where we would expect it to be important. Finally, we explore the role of force-assisted policies in domains with low visual acuity and characterize the degree to which force models can compensate for poor visual information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>Contact-Rich Manipulation. Contact-rich manipulation tasks, i.e. peg insertion, screw fastening, edge following, etc., are well-studied areas in robotic manipulation due to their prevalence in manufacturing domains. These tasks often are solved by hand-engineered polices which utilize force feedback and very accurate state estimation <ref type="bibr">[1]</ref>, resulting in policies that perform well in structured environments but do not generalize to the large range of real-world variability. More recent work has proposed the use of reinforcement learning to address this by training neural network policies which combine vision and proprioception <ref type="bibr">[6]</ref>, <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>. However, while these methods have been shown to perform well across a variety of domains and task variations, they require a high level of visual acuity, such that the task is solvable solely using image observations. In practice, this means these methods are unsuitable for a large portion of contact-rich manipulation tasks which require a high degree of precision and often include visual obstructions. Multimodal Learning. A common approach to multimodal learning is to first learn a latent dynamics model which compactly represents the high-dimensional observations and then use this model for policy learning. This technique has recently been adapted for use in various robotics domains to combine various types of heterogeneous data. <ref type="bibr">[18]</ref> combine vision and haptic information using a GAN but do not utilize their latent representation for manipulation policies. <ref type="bibr">[12]</ref> and <ref type="bibr">[11]</ref> learn physics models from cross-modal visual-tactile data for a series of tasks but they do not use this learned representation for either a hand-crafted policy or policy learning. Our work is most closely related to <ref type="bibr">[10]</ref>, <ref type="bibr">[9]</ref> which we use as baselines in this work. <ref type="bibr">[10]</ref> combine vision, force, and proprioceptive data using a variational latent model learned from self-supervision and use this model to learn a policy for peg insertion. <ref type="bibr">[9]</ref> learn a multimodal latent heatmap using a cross-modal visuo-tactile transformer (VTT) which distributes attention spatially. They show that by combining VTT with stochastic latent actor critic (SLAC), they can learn policies that can solve a number of manipulation tasks. In comparison to these works, we propose a sample-efficient deterministic multimodal representation that is learned endto-end without the need for pretraining. Equivariant Neural Networks. Equivariant networks were first introduced as G-Convolutions <ref type="bibr">[19]</ref> and Steerable CNNs <ref type="bibr">[13]</ref>, <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref>. Since their inception they have been applied across varied datatypes including images <ref type="bibr">[20]</ref>,</p><p>spherical data <ref type="bibr">[22]</ref>, <ref type="bibr">[23]</ref>, and point clouds <ref type="bibr">[24]</ref>. More recent work has expanded the use of equivariant networks to reinforcement learning <ref type="bibr">[15]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[25]</ref> and robotics <ref type="bibr">[26]</ref>, <ref type="bibr">[27]</ref>, <ref type="bibr">[28]</ref>, <ref type="bibr">[29]</ref>. Compared to these prior works which focus on a single data modality, this works studies the effectiveness of combining various heterogeneous datatypes while preserving the symmetry inherit in each of these data modalities. III. BACKGROUND Equivariant Neural Networks. A function is equivariant if it respects the symmetries of its input and output spaces. Specifically, a function f : X &#8594; Y is equivariant with respect to a symmetry group G if it commutes with all transformations g &#8712; G, f (&#961; x (g)x) = &#961; y (g)f (x), where &#961; x and &#961; y are the representations of the group G that define how the group element g &#8712; G acts on x &#8712; X and y &#8712; Y , respectively. An equivariant function is a mathematical way of expressing that f is symmetric with respect to G: if we evaluate f for various transformed versions of the same input, we should obtain transformed versions of the same output. Although this symmetry can be learned <ref type="bibr">[30]</ref>, in this work we require the symmetry group G and representation &#961; x to be known at design time. For example, in a convolutional model, this can be accomplished by tying the kernel weights together to satisfy K(gy) = &#961; out (g)K(y)&#961; in (g) -1 , where &#961; in and &#961; out denote the representation of the group operator at the input and output of the layer <ref type="bibr">[31]</ref>. End-toend equivariant models can be constructed by combining equivariant convolutional layers and equivariant activation functions. In order to leverage symmetry in this way, it is common to transform the input so that standard group representations work correctly, e.g., to transform an image to a top-down view so that image rotations correspond to object rotations. Extrinsic Equivariance. Often real-world problems contain symmetry corruptions such as oblique viewing angles and occlusions. This is particularly prevalent in robotics domains where the state of the world is rarely fully observable. In these domains we consider the symmetry to be latent where we know that some symmetry is present in the problem but cannot easily express how that symmetry acts in the input space. We refer to this relationship as extrinsic equivariance <ref type="bibr">[25]</ref>, where the equivariant constraint in the equivariant network enforces equivariance to out-of-distribution data. While extrinsic equivariance is not ideal, it does not necessarily increase error and has been shown to provide significant performance improvements in reinforcement learning <ref type="bibr">[25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. APPROACH</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Problem Statement</head><p>We model the visual force control problem as a discrete time finite horizon Markov decision process (MDP), M = (S, A, T , R, &#947;), where states s &#8712; S encode visual, force, and proprioceptive data and actions a &#8712; A command small end effector displacements. This MDP transitions at a frequency of 20 Hz and the commanded hand displacements are provided as positional inputs to a lower level Cartesian space admittance controller that runs at 500Hz with a fixed stiffness. The hand is constrained to point straight down at the table (along the -z direction).</p><p>State is a tuple s = (I, f, e) &#8712; S. I &#8712; R 4&#215;h&#215;w is a 4-channel RGB-D image captured from a fixed camera pointed at the workspace. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. O(2) Symmetries in Visual Force Domains</head><p>In order to leverage symmetric models for visual force policy learning, we utilize the group invariant MDP framework. A group invariant MDP is an MDP with reward and transition functions that are invariant under the group action, R(s, a) = R(&#961; s (g)s, &#961; a (g)a) and T (s, a, s &#8242; ) = T (&#961; s &#961; a (g)a, &#961; s (g)s &#8242; ), for elements of an appropriate symmetry group g &#8712; G <ref type="bibr">[14]</ref>, <ref type="bibr">[32]</ref>. &#961; s and &#961; a are representations of the group G that define how group elements act on state and action. This paper focuses on discrete subgroups of O(2) such as the dihedral groups D 4 or D 8 that represent rotations and reflections in the xy plane, i.e. the plane of the table. We utilize the D 8 group in our experiments.</p><p>In order to express visual force manipulation as a group invariant MDP, we must define how the group operates on state and action such that the transition and reward invariance equalities described above are approximately satisfied. State is s = (I, f, e) = (I, f xy , f z , m xy , m z , e xy , e z , e &#955; ). Since we are focused on rotations and reflections in the plane about the z axis, only the xy variables are affected. Therefore, the group g &#8712; SO(2) acts on s via</p><p>where &#961; 0 (g) is a linear operator that rotates/reflects the pixels in an image by g and &#961; 1 (g) is the standard representation of rotation/reflection in the form of a 2 &#215; 2 orthonormal matrix. Turning to action, a = (&#955;, &#8710;p xy , &#8710;p z , &#8710;p &#952; ), we define &#961; a (g)a = (&#955;, &#961; 1 (g)&#8710;p xy , &#8710;p z , &#8710;p &#952; ). Given these definitions, visual force manipulation satisfies the transition and reward invariance constraints, R(s, a) = R(&#961; s (g)s, &#961; a (g)a) and T (s, a, s &#8242; ) = T (&#961; s (g)s, &#961; a (g)a, &#961; s (g)s &#8242; ). This is illustrated for transition invariance in Fig. <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Model Architecture</head><p>As we discuss in the next section, we do policy learning using SAC which requires a critic (a Q-function) and an actor. In our method, both actor and critic employ the same encoder architecture which encodes state into a latent representation. Since our state s = (I, f, e) &#8712; S is multimodal (i.e. vision, force, and proprioception) our backbone is actually three encoders, the output of which is concatenated (Fig. <ref type="figure">2</ref>). The image encoder (top left in Fig. <ref type="figure">2</ref>) is a series of seven equivariant convolutional layers. The force encoder (middle left) is a single equivariant selfattention layer. The proprioceptive encoder (bottom left) is a four-layer equivariant MLP. In each of these encoders, the model respects the equivariance and invariance of each data modality corresponding to the relationships described in Section IV-B. These equivariant networks are implemented using the escnn <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref> library, where all the hidden layers are defined using regular representations.</p><p>The force encoder is of particular note due to its use of single-headed self-attention. The input is a set of T tokens, f &#8712; R T &#215;6 , that encode the most recent T measurements from the force-torque sensor. In order to make this model equivariant, we convert each of the key, query, and value networks to become equivariant models. For the standard implementation of self-attention, Attn = softmax(f W Q (f W K ) T )f W V , the resulting group self attention operation is equivariant <ref type="bibr">[35]</ref>:</p><p>where, for simplicity of this analysis, we define &#915; to be the linear representation of the action of a group element g &#8712; G and X f &#8712; R T &#215;6&#215;|G| .<ref type="foot">foot_1</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Equivariant SAC</head><p>For policy learning, we use Soft Actor Critic (SAC) <ref type="bibr">[36]</ref> combined with the model architecture described above. This can be viewed as a variation of Equivariant SAC <ref type="bibr">[14]</ref> that is adapted to visual force control problems. The policy is TABLE I: Number of trainable parameters in the policy learning tasks. Due to being latent representation learning methods, VTT and PoE utilize shared encoders between the actor and the critic so that the number of parameters is smaller than SVFL and CNN. Additionally, we utilize a smaller model for PoE as increasing the size of the PoE model has been shown to worsen performance [9]. Network SVFL CNN VTT POE # of Parameters 2.4E6 2.5E6 1.19E6 2.9E5</p><p>a network &#960; : S &#8594; A &#215; A &#963; , where A &#963; is the space of action standard deviations. We define the group action on the action space of the policy network &#257; &#8712; A &#215; A &#963; as: &#961; &#257;(g)&#257; = (&#961; a (g)a, a &#963; ), where a &#963; &#8712; A &#963; and g &#8712; G. The actor network &#960; is defined as a mapping s &#8594; &#257; that satisfies the following equivariance constraint: &#960;(&#961; s (g)s) = &#961; a (g)(&#960;(s)). The critic is a Double Q-network <ref type="bibr">[37]</ref>: q : S &#215; A &#8594; R that satisfies an invariant constraint: q(&#961; s (g)s, &#961; a (g)a) = q(s, a). Both the actor and critic are implemented using the escnn library. For the critic, the output is a trivial representation. For the actor, the output is a mixed representation containing one standard representation for the (x, y) actions, one signed representation for the &#952; action, and seven trivial representations for the (&#955;, z) actions alongside the action standard deviations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EXPERIMENTS</head><p>We performed a series of experiments both in simulation and on physical hardware to validate our approach, Symmetric Visual Force Learning (SVFL). First, we benchmark SVFL's performance in simulation against several alternative approaches. Second, we perform ablations that measure the contributions of different input modalities for different tasks under both ideal and degraded visual observations. Finally, we validate the approach on physical hardware.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Simulated Experiments</head><p>Tasks. We evaluate SVFL across nine manipulation tasks from the BulletArm benchmark <ref type="bibr">[38]</ref> which uses the Py-Bullet <ref type="bibr">[39]</ref> simulator: Block Picking, Block Pushing, Block Pulling, Block Corner Pulling, Mug Picking, Household Picking, Peg Insertion, Drawer Opening, and Drawer Closing (Fig. <ref type="figure">3</ref>). The workspace's size is 0.4m &#215; 0.4m &#215; 0.26m. The minimum z height is slightly beneath the table allowing the arm to come in contact with the table. The pixel size of the visual observation is 4&#215;76&#215;76 and is cropped to 4&#215;64&#215;64 during training and testing. The force data consists of the most recent 64 readings from the F/T sensor. The maximum movement allowed for any action is limited to &#8710;x, &#8710;y, &#8710;z &#8712; [-2.5cm, 2.5cm], &#8710;&#952; &#8712; [-&#960; 16 , &#960; 16 ], &#955; &#8712; [e min , e max</p><p>] where e min and e max are the joint limits of the gripper. For all tasks, a sparse reward function is used where a reward of +1 is given at task completion and 0 otherwise. Baselines. We benchmark our method against two prominent alternative methods for visual force (or visual tactile) learning that have been proposed recently: Visuo-Tactile Transformers (VTT) <ref type="bibr">[9]</ref> and Product of Experts (PoE) <ref type="bibr">[10]</ref>. We also compare against a non-symmetric version of our model that is the same in every way except that it does not use equivariant layers (CNN). Both PoE and VTT are latent representation methods which rely on a self-supervised pretraining phase to build a compact latent representation of the underlying states providing increased sample efficiency. Due to this pretraining, these methods represent attractive options for on-robot policy learning. In both baselines we used the encoder architectures proposed in <ref type="bibr">[9]</ref> which were shown to outperform those in <ref type="bibr">[10]</ref>. PoE encodes the different input modalities independently using separate encoders and combines them using product of experts <ref type="bibr">[10]</ref>. VTT combines modalities by using cross-modal attention on force to build latent representations that focus attention on important task features in the visual state space <ref type="bibr">[9]</ref>. For further details on these baselines, see <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref>. The latent encoders are pretrained for 10 4 steps on expert data to predict the reconstruction of the state, contact and alignment embeddings, and the reward. All methods use Prioritized Experience Relay (PER) <ref type="bibr">[40]</ref> pre-loaded with 50 episodes of the expert data. We augment the transitions by randomly cropping the visual observations and applying random rotations to the full observation. Parameter counts for all method can be found in Table <ref type="table">I</ref>. Results. We compared our method (SVFL) against the two baselines (POE and VTT) and the non-symmetric model (CNN) on the nine domains described above. Results are shown in Fig. <ref type="figure">4</ref>. All results are averaged over five runs starting from independent random seeds. When compared to the baselines, SVFL has significantly higher success rates and sample efficiency in all cases. We note the lower performance of VTT and PoE compared to <ref type="bibr">[9]</ref> (namely in the Peg Insertion task) which we attribute to our use of sparsereward functions which has been shown to lead to poor latent representations when using SLAC <ref type="bibr">[41]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Sensor Modality Ablation</head><p>Although it is intuitive that force data should help learn better policies on manipulation tasks, especially on contact rich tasks like peg insertion, it is important to validate this assumption and to measure the benefits that can be gained by using both vision and force feedback rather than vision alone. Recall that our state representation can be factored into three modalities, s = (I, f, e), where I is an image (vision), f is force, and e is the configuration of the robot hand (proprioception). Here, we compare the performance of SVFL with all three modalities against a vision-only model, a vision/force model, and a vision/proprioception model on the same tasks as in Section V-A. Results are shown in Fig. <ref type="figure">5</ref>. The results indicate that the inclusion of each additional sensor modality improves sample efficiency and performance for policy learning with all three sensor modalities performing best in most cases. However, notice that the degree to which force (and proprioceptive) data helps depends upon the task. For example, the addition of force feedback drastically improves performance in Peg Insertion but has almost no effect in Block Pulling. There are, however, many tasks between these extremes. In Drawer Opening and Block Picking the force-aware policy converges to a slightly higher success rate than the non-force assisted policies. The fact that force feedback is usually helpful, even in tasks where one might not expect it, is interesting. This suggests that there is real value in incorporating force feedback into a robotic learning pipeline, even when there is a non-trivial cost to doing so.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Role of Force Feedback When Visual Acuity is Degraded</head><p>We also perform experiments in the context of degraded visual acuity to determine what happens if the visual input to our model is scaled down significantly. Specifically, we evaluate the model on RGB-D images rescaled (bilinear interpolation) to four different sizes: 64 &#215; 64, 32 &#215; 32, 16 &#215; 16, and 8 &#215; 8. Aside from the rescaling, all other aspects of the model match the SVFL method detailed in the previous section. This experiment gives an indication of how force data can compensate for low resolution cameras, cloudy environments, or smudged camera lenses. Fig. <ref type="figure">6</ref> shows performance at convergence at the four different levels of visual resolution. We note several interesting observations. First, the importance of visual acuity is dependant on the task, e.g. high visual acuity is very important for Block Picking but not very important for Block Pulling. Second, force information generally tends to help the most in low visual acuity scenarios. Finally, while force data generally improves performance, it cannot compensate for the loss of information in extreme visual degradation in tasks which require high visual acuity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Real-World On-Robot Policy Learning</head><p>We repeat the simulated Block Picking policy learning experiment from Section V-B in the real world to evaluate our methods performance in the real-world. Fig. <ref type="figure">7</ref> shows the experimental setup which includes a UR5e robotic arm, a Robotiq Gripper, a wrist-mounted force-torque sensor, and a Intel RealSense camera. The block is a 5cm wooden cube that is randomly posed in the workspace. We utilize AprilTags to track the block for use in reward/termination checking and to automatically reset the workspace by moving the block to a new pose at the start of each episode. These tags are not utilized during policy learning. In order to facilitate faster learning, we modify a number of environmental parameters in our real-world setup. We use a workspace size as of 0.3m&#215;0.3m&#215;0.26m and a sparse reward function. We increase the number of expert demonstrations to 100 (from 50) and reduce the maximum number of steps per episode to 25 (from 50). Additionally, we reduce the action space by removing control of the gripper orientation and increase the maximum amount of movement the policy can take in one step to 5cm (from 2.5cm). We utilize the same model architecture as in Section V-A. Fig. <ref type="figure">7</ref> shows the learning curve of the full SVFL model alongside the various subsets of data modalities available to our method. We train all models for 3000 steps taking around 4 hours. As in the simulation results, the full SVFL model is both more sample efficient and outperforms SVFL modailty subsets. Additionally, we see that force sensing is a vital component in this setting with the force-aware models achieving a 90% success rate compared to the 60% success rate of the non-force aware models (at 3000 training steps).  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Real-World Block Centering</head><p>To further examine the advantages of SVFL, we conduct a supervised learning experiment using the same experimental setup in Fig. <ref type="figure">7</ref>. Here we learn a function, h : (I, f ) &#8594; {0, 1} 4 , that maps visual-force observations to a four-way classification denoting the direction in which the gripper would need to move in order to grasp the block after a finger collides with the block. The idea was to mimic the most common failure case we see during policy learning in block picking where the grasp was slightly offset from the block. The dataset is generated by a human teleoperator where each sample is the most recent sensor observations immediately following the collision. The goal of the teleoperator was to mimic a failed grasp where one finger came into contact with the block. We generate 200 data samples and split the dataset into 100 training samples and 100 testing samples. We generated a diverse set of interactions by varying the position of the gripper in relation to the block, the amount of force (by varying the amount of movement when coming into contact with the block), and the pose of the block.</p><p>We compare the classification accuracy of the baseline SVFL model against the non-symmetric version of the model. We examine the effect of three different types of input: Vision Only (V), Force Only (F), and Vision &amp; Force (V+F). In each case, in order to measure the models' ability to generalize, we evaluated the performance on training sets of differing sizes including 10, 25, 50, and 100 samples. Table <ref type="table">II</ref> shows the accuracy of the models on the held-out</p><p>TABLE II: Experiment on Robotic Hardware. Prediction accuracy (%) on the test set for models trained with different amounts of training data.</p><p>We compare the performance of equivariant and non-symmetric versions of the vision encoder (V), the force encoder (F), and the fusion of these two encoders (V+F). Mean and standard error is given over three runs. test dataset. Notice that in all cases, the symmetric model does much better than its non-symmetric counterpart, both for differently sized training sets as well as input types.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head># of Training Samples</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. DISCUSSION &amp; LIMITATIONS</head><p>This paper proposes Symmetric Visual Force Learning (SVFL), an approach to policy learning with visual force feedback that incorporates SE(2) symmetries into the model. Our experiments demonstrate that SVFL outperforms two recent high profile benchmarks, PoE <ref type="bibr">[10]</ref> and VTT <ref type="bibr">[9]</ref>, by a significant margin both in terms of learning speed and final performance. We also report a couple of interesting empirical findings. First, we find that force feedback is helpful across a wide variety of policy learning scenarios, not just those where one would expect force feedback to help, i.e. Peg Insertion. Second, we find that the positive effect of incorporating force feedback increases as visual acuity decreases. A limitation of this work is that although we expect that our framework is extensible to haptic feedback, this paper focuses on force feedback only. Additionally, we constrain our problem to top-down manipulation and planar symmetries in SE(2) and therefor there is significant scope to extend this to SE(3) symmetries. Finally, this paper focuses primarily on RL but the encoder architectures should be widely applicable to other learning techniques such as imitation learning.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: Northeastern University. Downloaded on December 06,2024 at 05:50:19 UTC from IEEE Xplore. Restrictions apply.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>Although we omit the positional encoding here, this does not affect the result<ref type="bibr">[35]</ref>.</p></note>
		</body>
		</text>
</TEI>
