<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Match Policy: A Simple Pipeline from Point Cloud Registration to Manipulation Policies</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>05/19/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10652219</idno>
					<idno type="doi">10.1109/ICRA55743.2025.11128725</idno>
					
					<author>Haojie Huang</author><author>Haotian Liu</author><author>Dian Wang</author><author>Robin Walters</author><author>Robert Platt</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Many manipulation tasks require the robot to rearrange objects relative to one another. Such tasks can be described as a sequence of relative poses between parts of a set of rigid bodies. In this work, we propose MATCH POLICY, a simple but novel pipeline for solving high-precision pick and place tasks. Instead of predicting actions directly, our method registers the pick and place targets to the stored demonstrations. This transfers action inference into a point cloud registration task and enables us to realize nontrivial manipulation policies without any training. MATCH POLICY is designed to solve high-precision tasks with a key-frame setting. By leveraging the geometric interaction and the symmetries of the task, it achieves extremely high sample efficiency and generalizability to unseen configurations. We demonstrate its state-of-the-art performance across various tasks on RLbench benchmark compared with several strong baselines and test it on a real robot with six tasks. Videos and code are available on https://haojhuang.github.io/match page/.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Many complex manipulation tasks can be decomposed as a sequence of pick-place actions, each of which can further be interpreted as inferring two geometric relationships: the pick pose is the relative pose between the gripper and the pick target, and the place pose is the relative pose between the pick target and the place target. Previous imitation learning methods <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref> directly predicted the pick-place actions given the entire observation signal after being trained with a copious amount of demonstrations. However, these methods did not highlight the importance of local geometric relationships and thus struggled to learn high-precision manipulation policies such as those required to solve the Plug-Charger and Insert-Knife tasks in RLBench <ref type="bibr">[4]</ref>. Meanwhile, recent works <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref> leverages segmented point clouds to reason about the geometric interaction between object instances. However, they often require a significant amount of effort before being applied to the real robots. Methods like NDFs <ref type="bibr">[5]</ref> and its variation <ref type="bibr">[9]</ref> require significant per-object pretraining and thus cannot be simply used on different object sets. Methods like Tax-Pose <ref type="bibr">[8]</ref> and RPDiff <ref type="bibr">[7]</ref> can only predict a single-step single-task action after hours of training, which dramatically limits their potential on multistep and long-horizon tasks.</p><p>To address the constraints of current methods and provide a convenient tool for robotic pick-place policies that require minimal effort to deploy across different tasks, we propose MATCH POLICY, a simple pipeline that transfers manipulation policy learning into point cloud registration (PCR).</p><p>MATCH POLICY constructs a combined point cloud of the desired scene using segmented point clouds, where objects are arranged in the expected configuration. As illustrated in Fig <ref type="figure">1</ref>, we store a collection of combined point clouds from the demonstration data. During inference, the point clouds of the pick and place objects are registered to these stored point clouds, and the resulting registration poses are used to compute the action. Unlike the prior works that require heavy training, we realize this pipeline with optimizationbased method: MATCH POLICY takes use of the RANSAC and ICP and produces the pick-place policy immediately after the demonstration collection.</p><p>Our proposed method has a couple of key advantages. First, the PCR step corresponds the local geometric details shown in the demonstration to the new observation, enabling the agent to solve high-precision tasks like Plug-Charger and Insert-knife. Second, MATCH POLICY illustrates great sample efficiency, i.e., the ability to learn good policies with relatively few expert demonstrations. We demonstrate it can achieve the compelling performance with only one demonstration and can generalize to many different novel poses with various experiments. Finally, MATCH POLICY shows high adaptability when tested with different camera settings, e.g., single camera view and low-resolution cameras, as well as on tasks with long horizons and articulated objects.</p><p>Our contribution in this work are as follows. 1) We provide a simple yet novel pipeline that realizes manipulation pickplace policy without any training. 2) We show the benefits of precision and sample efficiency of this method. 3) We demonstrate that it achieves compelling performance on both simulated and real-robot experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>Point Cloud Registration. Point cloud registration (PCR) is defined to find the best transformation that matches two sets of point clouds <ref type="bibr">[10]</ref>. Current methods can be distinguished into non-learning optimization methods and deep learning-based techniques <ref type="bibr">[11]</ref>. The non-learning methods include two representative ones, Iterative Closest Point (ICP) <ref type="bibr">[12]</ref> and RANSAC <ref type="bibr">[13]</ref> as well as their variations. ICP) <ref type="bibr">[12]</ref> and its variations <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[16]</ref>, <ref type="bibr">[14]</ref>, <ref type="bibr">[17]</ref> usually require an initial guess. They search for the closest correspondence points and estimate transformation until convergence. RANSAC-based methods <ref type="bibr">[18]</ref>, <ref type="bibr">[13]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref> can be interpreted as an outlier detection method and have also demonstrated effective registration results. These nonlearning methods are plug-and-play for any objects, although they often require sufficient overlap to guarantee successful registration <ref type="bibr">[21]</ref>. The current research also focuses on deep learning models to learn the local and global geometric representation to calculate the correspondence <ref type="bibr">[22]</ref>. Deep Closet Point (DCP) utilizes DGCNN <ref type="bibr">[23]</ref> to embed local features and a pointer network to calculate the correspondence <ref type="bibr">[24]</ref>. The PRNet <ref type="bibr">[25]</ref> introduced keypoint recognition to address partial to partial point cloud registration. Recently, Methods like Predator <ref type="bibr">[26]</ref> and PEAL <ref type="bibr">[27]</ref> introduced attention blocks to locate the overlap region and produce correspondence. In this work, we investigate non-learning optimization methods on robotic pick-place data.</p><p>Manipulation Learning on Point Cloud. As a flexible and informative representation of the robot's operating environment, point cloud has demonstrated superior effectiveness compared to other visual formats, such as RGB-D images <ref type="bibr">[28]</ref>, <ref type="bibr">[29]</ref>, <ref type="bibr">[30]</ref>. Recent works have broadly applied this rich representation across various robotic manipulation problems, including reinforcement learning <ref type="bibr">[31]</ref>, <ref type="bibr">[32]</ref>, <ref type="bibr">[33]</ref>, closed-loop policy learning <ref type="bibr">[28]</ref>, <ref type="bibr">[34]</ref>, <ref type="bibr">[35]</ref>, <ref type="bibr">[36]</ref>, key-point based methods <ref type="bibr">[37]</ref>, <ref type="bibr">[38]</ref>, <ref type="bibr">[3]</ref>, and robotic pick-place <ref type="bibr">[5]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[39]</ref>, <ref type="bibr">[40]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[41]</ref>. However, a major challenge in utilizing previous policy learning frameworks is their computational complexity and the significant effort required to adapt them to new tasks. In contrast, our method provides a simple and convenient solution to realizing manipulation pick-place policy without parameterization and training. It can be deployed effectively immediately after demonstration collection.</p><p>Manipulation Learning with Sample Efficiency. Robotic tasks defined in 3D Euclidean space are invariant to translations, rotations, and reflections which redefine the coordinate frame but do not otherwise alter the task. Recent advancements in equivariant modeling <ref type="bibr">[42]</ref>, <ref type="bibr">[43]</ref>, <ref type="bibr">[44]</ref>, <ref type="bibr">[45]</ref>, <ref type="bibr">[46]</ref> provide a powerful tool to encode symmetries in robotics and other related fields <ref type="bibr">[47]</ref>, <ref type="bibr">[48]</ref>, <ref type="bibr">[49]</ref>. <ref type="bibr">[50]</ref>, <ref type="bibr">[51]</ref>, <ref type="bibr">[52]</ref>, <ref type="bibr">[53]</ref> used equivariant models to leverage pick symmetries for grasp learning. <ref type="bibr">[54]</ref> proposed an equivariant policy for deformable and articulated object manipulation on top of pre-trained equivariant visual representation. <ref type="bibr">[55]</ref>, <ref type="bibr">[56]</ref> used equivariant maps to enable efficient planning. Other works <ref type="bibr">[57]</ref>, <ref type="bibr">[5]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[58]</ref>, <ref type="bibr">[59]</ref>, <ref type="bibr">[60]</ref>, <ref type="bibr">[39]</ref>, <ref type="bibr">[40]</ref>, <ref type="bibr">[8]</ref>, <ref type="bibr">[41]</ref> leverage symmetries in pick and place and achieve high sample efficiency. <ref type="bibr">[61]</ref>, <ref type="bibr">[62]</ref> explore the symmetry under the language-conditioned policy and realize few-shot learning with language steerable kernels. Recently, <ref type="bibr">[63]</ref>, <ref type="bibr">[64]</ref>, <ref type="bibr">[65]</ref>, <ref type="bibr">[66]</ref>, <ref type="bibr">[67]</ref>, <ref type="bibr">[68]</ref>, <ref type="bibr">[69]</ref>, <ref type="bibr">[70]</ref>, <ref type="bibr">[54]</ref>, <ref type="bibr">[71]</ref> realized equivariant close-loop policies and demonstrated better generalization performance with fewer demonstrations. Compared with previous work, our method takes the advantage of point cloud registration to realize the equivariant policy and shows an improvement in sample efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROBLEM STATEMENT</head><p>Considering a set of expert demonstrations {D i } n i=1 , each demonstration D i consists of a sequence of pick and place. We represent each pick or place sample with objectcentric point clouds and their transformations of the form (P a , P b , T a , T b , &#8467;), where P a &#8712; R n&#215;3 and P b &#8712; R m&#215;3 are point clouds that represent two objects of interest, T a &#8712; R 4&#215;4 and T b &#8712; R 4&#215;4 are two rigid transformations in SE(3) represented in homogeneous coordinates that can transform P a and P b into the desired configuration, &#8467; is the language description explaining the action and objects. In our settings, if &#8467; illustrates a pick action, (P a , P b ) will represent the gripper and the pick target. If it indicates the preplace/place action, (P a , P b ) indicates the placement and the object to arrange, respectively. Our goal is to model the policy function f : (P a , P b , &#8467;) &#8594; a which outputs the gripper movement a &#8712; SE(3) and can generalize to new observed point clouds in different configurations. The policy is formulated to generate the multi-step pick-place actions in the open-loop manner and each single-step action is parameterized with (a pick , a preplace , a place )<ref type="foot">foot_0</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. METHOD</head><p>We first explain the procedure (Fig 1 ) of MATCH POLICY which takes the segmented point clouds as input and outputs the key-frame actions (a pick , a preplace , a place ). We denote Pa and Pb as the observed point clouds during inference, to distinguish them from the demonstrated point clouds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Procedure of MATCH POLICY</head><p>Storing Combined Point Clouds P ab . We first construct the combined point cloud P ab from the demonstration sample (P a , P b , T a , T b , &#8467;) by</p><p>where &#8226; transforms the two segmented point clouds with T a and T b to the desired configuration, and &#8746; concatenates the two transformed point clouds. In other words, P ab represents either the desired pick configuration or the desired preplace/place configuration, as shown in Fig 1 . Compared to using the entire scene's point cloud, this approach reduces occlusion and filters out irrelevant information. Each P ab is described by the language description &#8467;. Taking the Phoneon-Base task shown in Fig <ref type="figure">1</ref> and <ref type="figure">2</ref> for example, there are three P ab denoted by three descriptions, "pick up the phone", "preplace the phone above the base" and "place the phone on the base". We store each pair of (&#8467;, P ab ) as a key-value element for every demonstration. It results in a dictionary across all the tasks and all the demonstrations. Registering Pa and Pa to P ab . During inference, we first extract the point clouds of objects of interest from observation. After retrieving the P ab with the language description &#8467; as the key, our registration model f r : ( Pa , Pb , P ab ) &#8594; ( Ta , Tb ) outputs the poses that match Pa and Pb to the combined point cloud P ab . We realize f r with optimizationbased registration methods. Specifically, we begin by applying RANSAC <ref type="bibr">[20]</ref> to obtain the initial alignment, followed by colored ICP <ref type="bibr">[17]</ref> for iterative refinement.</p><p>Apart from inferring the registration pose, we calculate the fitness score S = # of inlier correspondences # of points in source to measure the registration quality. We run the registration model f r with every sample that matches the key for several times with random seeds and calculate the fitness score. We run it multiple times because RANSAC is a stochastic algorithm. It results in a set of registration results {( Ta , Tb , S a , S b )} and we select the best pair of registration poses using the highest average fitness score over S a and S b .</p><p>Calculating a pick , a preplace and a place . After estimating the registration poses ( Ta , Tb ) for pick, preplace and place with the language key &#8467; respectively, we calculate the pick action as the relative pose to arrange the gripper to the current pick target Pb , i.e., a pick = ( Tb ) -1  Ta . The preplace and place action are determined by moving the pick target Pb while keeping the placement Pa stationary, to match desired configuration, i.e., a place = ( Ta ) -1  Tb . Finally, our method outputs (a pick , a preplace , a place ) that can be used to control the robot arm. This process can be repeated to infer a sequence of keyframe actions to solve complex tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Sample efficiency Analysis on MATCH POLICY</head><p>We then analyze the equivariant property of our method through the lens of equivariance with a mild assumption. Since RANSAC is stochastic voting scheme, by computing a greater number of iterations, the probability of an optimal registration being produced is increased, especially when the overlapping area is above 50%. We can assume our registration model f r is optimal after enough running times:</p><p>Assmuption 1: f r : ( Pa , Pb , P ab ) &#8594; ( Ta , Tb ) is optimal. For all g &#8712; SE(3), f r will have the following properties:</p><p>With Assumption 1, we conclude three equivariant properties of MATCH POLICY that improve the sample efficiency. In the following part, we will only include a pick and a place to reduce redundancy. We use f pick and f place to represent the pick and place predictors of the policy function f . Invairant Symmetry. We first show that MATCH POLICY generates invariant prediction of (a pick , a place ) when the demonstration point clouds P ab transforms.</p><p>Proposition 1: a pick and a place are invariant to transformation g &#8712; SE(3) acting on P ab .</p><p>Proof: By Assumption 1a, if P ab is transformed by g &#8712; SE(3), the calculated registration poses are transformed to g &#8226; Ta and g &#8226; Tb . The new pick action can be calculated as a &#8242; pick = (g Tb ) -1 g Ta = T -1 b g -1 g Ta = a pick . Similarly, the new place action a &#8242; place = a place . Proposition 1 states that many different demonstrations that produce differently transformed P ab result in the same action prediction using one demonstration in the same group. It enables our method to achieve good performance with very few demonstrations.</p><p>Bi-equivariant Place Symmetry. As noted in previous work <ref type="bibr">[6]</ref>, <ref type="bibr">[60]</ref>, <ref type="bibr">[58]</ref>, <ref type="bibr">[39]</ref>, <ref type="bibr">[40]</ref>, the relative place actions that rearrange object B to another object A are bi-equivariant. That is, independent transformations of object A with g a &#8712; SE(3) and object B with g b &#8712; SE(3) result in a change (a &#8242; place = g a a place g -1 b ) to complete the rearrangement at the new configuration. Leveraging bi-equivariant symmetries can generalize the stored place knowledge to different configurations and improve the sample efficiency.</p><p>Proposition 2: The place action inference of MATCH POLICY is bi-equivariant: f place (g a &#8226; P a , g b &#8226; P b ) = g a f place (P a , P b )g -1 b . Proof: If P a and P b are transformed by g a and g b respectively, the calculated registration poses are Ta (g a ) -1 and Tb (g b ) -1 (Assumption 1bc). The new place action can be estimated as</p><p>b , which satisfies the bi-equivariance. The bi-equivariant design of our method significantly improves the sample efficiency. It enables our model to evaluate effectively on a lower dimensional space, the equivalence classes of samples under the SE(3) x SE(3) group.</p><p>Equivariant Pick Symmetry. Lastly, we show that the pick action inference is also equivariant to transformations on the pick target, i.e., f pick (P a , g b &#8226; P b ) = g b f pick (P a , P b ).</p><p>Proof: By Assumption 1c, if the pick target is transformed by g b , the outputted registration pose is transformed to Tb g b -1 and the new pick action can be calculated as</p><p>Ta = g b a pick , which realizes the pick equivariance.</p><p>Proposition 3 states that if the pick target transforms, the pick action will transform accordingly. The equivariant pick symmetry enables our method to acquire a better generalization of the stored pick knowledge to many different poses with few demonstrations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p># demos phone-on-base stack-wine put-plate slide-roll plug-charger insert-knife MATCH POLICY (ours) 1 93.33 (&#177; 4.62) 74.67 (&#177; 2.31) 10.67 (&#177; 2.31) 0 34.67 (&#177; 16.65) 44.00 (&#177; 8.00) Imagination Policy [6] 1 4.00 (&#177; 4.52) 2.67 (&#177; 2.61) 1.33 (&#177; 2.61) 2.78 (&#177; 2.72) 0 0 MATCH POLICY (ours) 10 100 (&#177; 0.00) 98.67 (&#177; 2.31) 13.33 (&#177; 6.11) 7.24 (&#177; 9.05) 40.00 (&#177; 4.00) 61.33 (&#177; 2.31) Imagination Policy [6] 10 90.67 (&#177; 2.61) 97.33 (&#177; 2.61) 34.67 (&#177; 10.45) 23.61 (&#177; 5.44) 26.67 (&#177; 13.82) 42.67 (&#177; 9.42) RPDiff [7] 10 62.67 (&#177; 5.22) 32.00 (&#177; 4.52) 5.33 (&#177; 5.22) 0 0 2.67 (&#177; 2.61) RVT [2] 10 56.00 (&#177; 4.52) 18.67 (&#177; 2.61) 53.33 (&#177; 6.91) 0 0 8.00 (&#177; 4.52) PerAct [1] 10 66.67 (&#177; 11.39) 5.33 (&#177; 2.62) 12.00 (&#177; 4.52) 0 0 0 3D Diffusor Actor [3] 10 29.33 (&#177; 5.22) 26.67 (&#177; 14.55) 12.00 (&#177; 0.00) 0 0 0 Key-Frame Expert 100 100 74.6 56 72 90.6</p><p>TABLE I. Performance comparisons on RL benchmark. Success rate (%) on 25 tests when using 1 or 10 demonstration episodes. Results are averaged over 3 runs.</p><p>The above analyses provide the theoretical proof why MATCH POLICY can be sample efficient. Generally, realrobot data is noisy and the transformation of the objects will not result in the exactly same transformation in the observed point clouds due to occlusion and distortion. Nonetheless, the PCR method remains useful for identifying correspondences between key geometric features. We further evaluate our proposed method across different settings and a variety of experiments, providing detailed analyses of the results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. IMPLEMENTATION</head><p>In our implementation, we first collect the robot data of the traditional form (o t , a t ), where o t is the observation captured by one or multiple RGB-D cameras and a t = (pick, preplace, place) defined in the world frame. We then segment P a and P b using the masks. The masks can either be ground-truth masks from a simulator or computed using current segmentation methods <ref type="bibr">[72]</ref>, <ref type="bibr">[73]</ref>, which are beyond the scope of this work. The gripper point cloud is directly sampled from the gripper mesh file at its canonical pose. For the pick, we store T a as the pick pose and T b as an identity matrix. For the place, T a is an identity matrix, while T b is calculated as the relative pose between pick and place poses. After constructing P ab , we downsample it by voxelizing the input with a 4mm voxel dimension. Since dictionary lookups have an average time complexity of O(1), we store all the combined point clouds from various tasks in a single dictionary. We also store the color information for each point cloud. The RANSAC <ref type="bibr">[20]</ref> and colored ICP <ref type="bibr">[17]</ref> are implemented using Open3d <ref type="bibr">[74]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. SIMULATIONS</head><p>We first test our proposed method on several compelling simulated tasks with a limited number of demonstrations. Our primary baselines is Imagination Policy <ref type="bibr">[6]</ref> which is also a bi-equivariant model using segmentations. We adopt their simulated experimental settings and baselines to ensure a fair comparison</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Simulated Experiments</head><p>Task Description. We use the same six RLbench tasks <ref type="bibr">[4]</ref> as used by Imagination Policy <ref type="bibr">[6]</ref>. Phone-on-Base: The agent is asked to pick up the phone and place it onto the phone base correctly. Stack-Wine: This task includes grabbing the wine bottle and putting it on the wooden rack at one of three specified locations. Put-Plate: The agent must pick up the plate and insert it between the red spokes in the colored dish rack. The colors of other spokes are randomly generated from the full set of 19 color instances. Slide-Roll: This task involves grasping the toilet roll and sliding it onto its stand. This task is a high-precision task. Plug-Charger: The agent is asked to pick up the charger and plug it into the power supply on the wall. This also requires high precision. Insert-Knife: This task requires picking up the knife from the chopping board and sliding it into its slot in the knife block. The different 3D tasks are shown graphically in Fig 2 . During the test, object poses are randomly sampled at the beginning of each episode and the agent must generalize to novel poses.</p><p>Baselines. Our method is compared against five strong baselines: Imagination Policy <ref type="bibr">[6]</ref> is our primary baseline. It consumes the segmented P a and P b to generate the combined point cloud with a point flow model <ref type="bibr">[75]</ref> and uses SVD to calculate the registration poses. RPDiff <ref type="bibr">[7]</ref> consumes segmented P a and P b and denoises the relative pose iteratively. PerAct <ref type="bibr">[1]</ref> is a multi-task behavior cloning agent using transformer to process the voxel grids to learn a language-conditioned policy. RVT <ref type="bibr">[2]</ref> projects the 3D observation onto five orthographic images and uses the dense feature map of each image to generate 3D actions. 3D Diffuser Actor <ref type="bibr">[3]</ref> is a variation of Diffusion Policy <ref type="bibr">[76]</ref> that denoises noisy actions conditioned on point cloud features. Key-Frame Expert: Since some tasks are very complex, we also report the performance of the expert agent that uses the key-frame action extracted from the demonstration to measure the effects of path planning. Other methods like NDFs <ref type="bibr">[5]</ref> and its variation <ref type="bibr">[9]</ref> are not included since they require per-object pretraining.</p><p>Settings and metrics. There are four cameras (front, right shoulder, left shoulder, hand) pointing toward the workspace. We use the ground truth mask to segment P a and P b for RPDiff <ref type="bibr">[7]</ref>, Imagination Policy <ref type="bibr">[6]</ref> and our method. Since the environments we test on are relatively uncluttered, making it relatively easy to extract object masks from images or voxel maps, segmentation is thus not a performance bottleneck for RVT <ref type="bibr">[2]</ref> which uses orthographic images or PerAct <ref type="bibr">[1]</ref> which uses voxel maps. All methods are evaluated on 25 unseen configurations and each evaluation is averaged over 3 evaluation seeds.</p><p>Results. We report the success rate of each method in Table <ref type="table">I</ref> and draw several findings from it. <ref type="bibr">(1)</ref>. With 10 demos, MATCH POLICY outperforms all the baselines on 4 out of 6 tasks. It also performs better with only 1 demo than all baselines trained with 10 demos on 3 out of 6 tasks. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Test on Different Camera Settings</head><p>Model # demos phone-on-base stack-wine insert-knife</p><p>480 &#215; 480 images 10 100 98.67 61.33 128 &#215; 128 images 10 100 92.00 40.00 Single Front View 10 76.00 88.00 12.00</p><p>TABLE II. Ablation Study on camera settings. Success rate (%) on 25 tests.</p><p>The ability to adapt to various camera settings is a crucial aspect of a model's robustness and versatility. We test our method with three different camera settings: 1). four RGB-D cameras with high resolution; 2). four RGB-D cameras with low resolution; 3). a single front camera view with high resolution. Table <ref type="table">II</ref> includes the results of three tasks tested with our method. It shows that low-resolution images slightly decrease the performance and multi-view images provide a more complete observation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Task with Articulated Object</head><p>Unlike rigid bodies, articulated objects consist of several movable parts linked together. We test our proposed method on Open Microwave to illustrate its potential for articulated object manipulation. As shown in Fig <ref type="figure">3</ref>, the task requires the robot to grasp the microwave handle and open the door. We segment the two moveable parts of the microwave and predict the relative pose between the gripper and the TABLE III. Performance on Open-Microwave. Success rate (%) on 25 tests using 10 demonstrations.</p><p>door and the relative pose between the door and the frame.</p><p>The results are reported in Table <ref type="table">III</ref>. Please note complex articulated object manipulation can also be implemented in the same way as inferring the relative poses between links.</p><p>The majority of failure cases occurred when the registration model inaccurately registered the door, leading to errors of place action. Task # demos MATCH POLICY expert put-item-in-drawer 10 96.00 96.00</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Task with Long Horizon</head><p>TABLE IV. Performance on Put-Item-in-Drawer. Success rate (%) on 25 tests using 10 demonstrations.</p><p>Many complex tasks can be decomposed as a sequence of pick-place actions. We test MATCH POLICY on a longhorizon task -Put Item in Drawer. As shown in Fig <ref type="figure">4</ref>, the agent needs to first open the bottom drawer and then pick up the red block and finally put it in the drawer. We address it by inferring two pick-place actions the results are included in Table <ref type="table">IV</ref>. It shows that our proposed method achieves 96% success rate which is the same as the oracle agent's performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. REAL-ROBOT EXPERIMENTS</head><p>Our proposed method can efficiently be deployed on realrobot tasks without any training, and we test it on 6 real-robot tasks by collecting only 10 demos for each task. To evaluate MATCH POLICY comprehensively, we assess its performance across different numbers of camera views and fitness scores.</p><p>Hardware Settings. The experiment is performed on a UR5 robot on a 48cm &#215; 48cm &#215; 48cm workspace. There are three RealSense 455 cameras mounted around the workspace. To collect the demos, we released the UR5 brakes to push the arm physically. For execution on a robot, it requires a collision-free pick-and-place trajectory that connects the 7 0.7 pick place pick place pick place pick place pick place pick place pick place pick place #success/#trival 14/15 14/14 13/15 6/13 6/15 4/6 14/15 10/14 15/15 15/15 15/15 15/15 14/15 11/14 10/15 9/10 action success rate (%) 93.3 100 87.0 46.2 40.0 66.7 93.3 71.4 100 100 100 100 93.3 78.6 66.7 90 completion rate (%) 93.3 40.0 27.0 66.7 100 66.7</p><p>TABLE V. Performance on real-robot experiments. Success rate (%) on 15 tests for each task using 10 demonstrations.</p><p>predicted actions. We use MoveIt with RRT-star as our motion planner to generate the trajectory.</p><p>Tasks. As shown in Fig <ref type="figure">5</ref>, we test six different real-robot tasks. The action space of all the tasks are defined in SE(3). Putting-Banana: This task asks the agent to pick up the banana and put it on the plate. Hanging-Mug: The robot needs to pick up the mug and hanging it on the poke of the mug holder. Inserting-Flower: This task includes picking up the flower and plugging it into the base. Pouring-Ball: The agent is asked to grasp the small blue cup and pour the white ball into the big green cup. Packing-Shoes: It is a multi-step task and includes picking a pair of shoes and placing them on the shelf. Arranging-Letters: It is a multi-step task requiring the agent to arrange three letters with two pick-place actions.</p><p>Results. Objects are randomly placed in the workspace during testing, and we run 15 tests for each task. Our results are included in Table <ref type="table">V</ref>. We report the action success rate for each step as well as the task completion rates. For Putting-Banana, Packing-Shoes, and Arranging-Letters, we use a single camera view and set the fitness threshold as 0.7 to filter out low-confidence actions. For the other three tasks, we use three camera views without the fitness threshold. As shown in the first row of <ref type="bibr">Fig 5,</ref><ref type="bibr"/> real-sensor data is typically noisy due to occlusion and distortion. However, MATCH POLICY can achieve above 90% success rate on Putting-Banana and Packing-Shoes with only 10 demos. We observe a performance drop in Hanging-Mug and Inserting-Flower and we hypothesize that sensor noise is a key factor of this decline, which causes critical features like the hanger's hook and the mug's handle to disappear. This results in inaccurate registration poses and affects task performance. We provide the real-robot videos in the supplementary materials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. CONCLUSION</head><p>In this work, we proposed MATCH POLICY, a simple yet effective pipeline that leverages point cloud registration to address manipulation pick-place tasks. Our goal is to provide a convenient and practical tool for pick-place policies that requires minimal effort to deploy across different tasks. MATCH POLICY demonstrates significant improvement in sample efficiency and strong performance across various tasks. It can also be applied to the long-horizon task and articulated object manipulation. We also provide a theoretical analysis of its performance, focusing on symmetric properties. Finally, we validate our proposed method on a real robot through six tasks and different settings.</p><p>One limitation of this work is that it can only output an open-loop policy instead of a high-frequency closedloop policy. We believe the pick-place policy holds significant value for solving real-world challenges, particularly in industrial settings such as picking, packing, sorting, and stacking. Another limitation is that the current formulation requires segmenting the object of interest. Fortunately, a large number of SOTA segmentation methods <ref type="bibr">[73]</ref>, <ref type="bibr">[72]</ref> provide a convenient tool to address it. Lastly, this work assumes the object is seen and cannot be generalized to novel objects. We believe the current research on unseen object registration <ref type="bibr">[77]</ref>, <ref type="bibr">[78]</ref>, <ref type="bibr">[79]</ref>, <ref type="bibr">[22]</ref> can gradually address this issue and we leave it as our feature work.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>It can also be referred to as key-frame action. The preplace action is important to solve complex tasks, e.g., Inserting and Hanging, without any predefined prior actions.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Authorized licensed use limited to: Northeastern University. Downloaded on November 19,2025 at 16:23:22 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
