<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Self-Supervised Unseen Object Instance Segmentation via Long-Term Robot Interaction</title></titleStmt>
			<publicationStmt>
				<publisher>Robotics: Science and Systems Foundation</publisher>
				<date>07/10/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10494269</idno>
					<idno type="doi">10.15607/RSS.2023.XIX.017</idno>
					
					<author>Yangxiao Lu</author><author>Ninad Khargonkar</author><author>Zesheng Xu</author><author>Charles Averill</author><author>Kamalesh Palanisamy</author><author>Kaiyu Hang</author><author>Yunhui Guo</author><author>Nicholas Ruozzi</author><author>Yu Xiang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce a novel robotic system for improving unseen object instance segmentation in the real world by leveraging long-term robot interaction with objects. Previous approaches either grasp or push an object and then obtain the segmentation mask of the grasped or pushed object after one action. Instead, our system defers the decision on segmenting objects after a sequence of robot pushing actions. By applying multi-object tracking and video object segmentation on the images collected via robot pushing, our system can generate segmentation masks of all the objects in these images in a self-supervised way. These include images where objects are very close to each other, and segmentation errors usually occur on these images for existing object segmentation networks. We demonstrate the usefulness of our system by fine-tuning segmentation networks trained on synthetic data with real-world data collected by our system. We show that, after fine-tuning, the segmentation accuracy of the networks is significantly improved both in the same domain and across different domains. In addition, we verify that the finetuned networks improve top-down robotic grasping of unseen objects in the real world 1 .]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Object perception is a critical task in robot manipulation. Model-based methods leverage 3D models of objects and solve the 6D object pose estimation problem to localize objects in 3D <ref type="bibr">[12,</ref><ref type="bibr">37,</ref><ref type="bibr">33,</ref><ref type="bibr">35]</ref>. Using the estimated object poses and the 3D models of objects, a planning scene can be set up for manipulation trajectory planning. However, requiring a 3D model for every object that needs to be manipulated is not feasible in the real world. Recent model-free approaches for object perception focus on segmenting unseen objects from images <ref type="bibr">[38,</ref><ref type="bibr">40,</ref><ref type="bibr">14]</ref>. A segmented point cloud of an object can be used in grasp planning for robot manipulation <ref type="bibr">[21,</ref><ref type="bibr">29]</ref>. In this way, an object can be grasped from partial observations without using its 3D model.</p><p>Recent model-based and model-free methods for object perception train neural networks to recognize objects. Since it is difficult to obtain large-scale real-world datasets in robot manipulation settings, synthetic data is widely used for training <ref type="bibr">[32,</ref><ref type="bibr">39,</ref><ref type="bibr">5]</ref>. Although models trained with synthetic data can be directly used in the real world by leveraging domain randomization <ref type="bibr">[31]</ref> or domain transfer <ref type="bibr">[6,</ref><ref type="bibr">44]</ref> techniques, these models still have errors in the real world due to the sim-to-real gap. The question we would like to address in this paper is how can a robot automatically obtain training data in the real 1 Video, dataset and code are available at <ref type="url">https://irvlutd.github.io/ SelfSupervisedSegmentation</ref>  world to improve its object segmentation model pre-trained with synthetic data. We focus on improving Unseen Object Instance Segmentation (UOIS) to facilitate robot manipulation.</p><p>Interactive perception <ref type="bibr">[7]</ref> emphasizes that robots can apply actions to the environments and utilize the visual-motor relationship to improve perception. In the context of object recognition, two widely used interaction types are robot grasping and pushing. Previous works have explored leveraging robot grasping or pushing to obtain object segmentation data in a self-supervised way <ref type="bibr">[24,</ref><ref type="bibr">15,</ref><ref type="bibr">41]</ref>. All these methods can only obtain the segmentation mask of the grasped or pushed object by comparing the scene before and after grasping <ref type="bibr">[24]</ref> or utilizing optical flow to segment the moved objects in robot pushing <ref type="bibr">[15,</ref><ref type="bibr">41]</ref>. The drawbacks of segmenting objects from one action are that, first, the method cannot segment unmoved objects in the scene; second, if two objects are moved together, the method will segment them as one object. Although <ref type="bibr">[41]</ref> proposes to train a classifier to decide whether a single object is pushed or not, since the classifier is trained in simulation, it still suffers from the sim-to-real gap.</p><p>To overcome the limitations of existing work on selfsupervised object segmentation via robot interaction, we propose a new system that leverages long-term robot interaction to segment unseen objects in a self-supervised way. Our key idea is to defer the decision on object segmentation until a robot has interacted with all the objects in a scene for a period of time. Intuitively, if a robot has pushed objects in a scene for a number of times, i.e., around 20 pushes for 5 objects in our experiments, these objects are very likely to be separated from each other. Once the objects are separated, existing approaches on unseen object segmentation such as <ref type="bibr">[38,</ref><ref type="bibr">19]</ref> can successfully segment them. In this way, our system can segment all the objects in the scene but not only the pushed object in one action. More importantly, the system enables the robot to propagate a correctly segmented mask of each object to all the collected images during robot pushing including images where objects are very close to each other. This is achieved by combining multi-object tracking to extract object tracklets, i.e., segments of objects in video frames, and video object segmentation where an initial mask of an object can be propagated to all other frames. The system utilizes the object tracklet to select a good initial mask for propagation. Consequently, our system enables a robot to collect a sequence of images of objects in a scene and obtain segmentation masks of all the objects in these images.</p><p>We demonstrate the usefulness of our system by using the collected real-world images to fine-tune existing, pretrained object segmentation models <ref type="bibr">[19]</ref>. We show that after fine-tuning, the object segmentation accuracy of the model can be significantly improved. The improvement is achieved in the same domain as the fine-tuning data as well as on the benchmark datasets for evaluating unseen object instance segmentation <ref type="bibr">[25,</ref><ref type="bibr">28]</ref>. Fig. <ref type="figure">1</ref> illustrates the fine-tuning process. In addition, we show that using the fine-tuned segmentation model can improve top-down grasping performance in a table clearing task where a robot is asked to put all the objects on a table into a bin. In summary, the contributions of our work are as follows.</p><p>&#8226; We introduce a novel robotic system that leverages longterm robot interaction to segment unseen objects in a selfsupervised way.</p><p>&#8226; Our system illustrates that combining multi-object tracking and video object segmentation with robot pushing can help robots to singulate objects from each other in cluttered scenes. &#8226; We demonstrate that using our system to collect realworld images for fine-tuning can improve object segmentation accuracy and robot grasping performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Unseen Object Instance Segmentation</head><p>Different from category-based object instance segmentation methods <ref type="bibr">[17,</ref><ref type="bibr">8,</ref><ref type="bibr">9]</ref> that focus on segmenting object instances among a set of pre-defined object categories, unseen object instance segmentation emphasizes segmenting arbitrary objects that present in input images. The testing objects can be novel such that a segmentation model has not seen them during training. Earlier works on UOIS utilize low-level image cues such as edges, contours, and surface normals to group pixels into objects <ref type="bibr">[25,</ref><ref type="bibr">34,</ref><ref type="bibr">11]</ref>. These bottom-up approaches tend to over-segment objects since there is no object-level supervision to learn the concept of objects. Recent approaches on UOIS leverage large-scale synthetic data and deep neural networks to segment unseen objects <ref type="bibr">[39,</ref><ref type="bibr">38,</ref><ref type="bibr">40,</ref><ref type="bibr">14]</ref>. These methods significantly improve object segmentation accuracy, which enables robotic grasping of unseen objects <ref type="bibr">[21,</ref><ref type="bibr">29]</ref>. However, since these models are trained with synthetic data, they still suffer from the sim-to-real gap. The primary error is under-segmentation in the real world. When objects are very close to each other, the models trained with synthetic data cannot separate them. Recently, Zhang et al. <ref type="bibr">[44]</ref> propose to apply test-time domain adaption to improve the segmentation performance, where a set of images without ground truth labels in the test domain are used to adapt the segmentation network. Our system is complementary to domain adaption techniques since it is able to obtain training images with ground truth labels automatically. Therefore, we can use supervised learning to fine-tune segmentation networks. More importantly, we show that, after fine-tuning in one domain, the performance of the segmentation networks can be improved in other domains, which avoids adaption in every testing domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Self-Supervised Robot Perception</head><p>Self-supervised learning is an attractive learning paradigm where training data and training signals can be obtained automatically without human labor. Since a robot can naturally interact with its environment to collect data <ref type="bibr">[7]</ref>, selfsupervised learning for robot perception has received more attentions recently. One type of approach utilizes multi-view consistency of images captured from different viewpoints to obtain the ground truth annotations for learning. Multi-view consistency based self-supervised learning has been applied to object segmentation <ref type="bibr">[42]</ref>, object detection <ref type="bibr">[20]</ref>, 6D object pose estimation <ref type="bibr">[13]</ref> and dense pixel-wise correspondences <ref type="bibr">[26,</ref><ref type="bibr">16]</ref> in robot manipulation settings. Another type of approach leverages robot actions such as grasping and pushing to interact with objects and then computes scene differences <ref type="bibr">[24]</ref> or optical flow <ref type="bibr">[15,</ref><ref type="bibr">41]</ref> before and after applying an action to obtain ground truth labels of objects for learning. Our system falls into this category where we also employ robot pushing with optical flow to help segment objects in a self-supervised way. The main novelty of our system compared to previous methods on self-supervised object segmentation <ref type="bibr">[15,</ref><ref type="bibr">41]</ref> is that we leverage long-term robot pushing to segment all the objects in a collected video sequence, while previous methods can only segment the grasped or pushed object in an image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. SELF-SUPERVISED UNSEEN OBJECT INSTANCE SEGMENTATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. System Overview</head><p>The motivation to build our system is to fix segmentation errors in existing UOIS methods <ref type="bibr">[38,</ref><ref type="bibr">19]</ref>. These methods are trained with synthetic RGB-D images generated using 3D models of objects. Due to the sim-to-real gap and the arrangements of objects in the simulator, these methods often cannot separate objects that are very close to each other. One example is shown in the first initial segmentation image  in Fig. <ref type="figure">2</ref>, where five objects are packed together and the MSMFormer <ref type="bibr">[19]</ref> only outputs one mask for all five objects. In grasping applications, a robot cannot grasp these objects due to the incorrect segmentation result. Our idea to fix these errors is to obtain ground truth masks of these packed objects in a selfsupervised way by leveraging robot interaction with objects. Then, we can use these images with the corresponding ground truth masks to fine-tune the segmentation networks <ref type="bibr">[38,</ref><ref type="bibr">19]</ref>.</p><p>With enough data for fine-tuning, the networks should be able to segment closely packed objects.</p><p>The main challenge in this scenario is obtaining the ground truth masks when objects are close to each other. Previous methods that leverage robot interaction to obtain object masks <ref type="bibr">[15,</ref><ref type="bibr">41]</ref> can only obtain one mask of the pushed or grasped object in an image. They cannot generate masks of all the objects in the scene because they only use one robot action and try to figure out which object has been moved. Instead, in our system, we allow the robot to continuously push objects in a random fashion, and we generate a sequence of images before and after each pushing action, i.e., around 20 pushes for each scene in our experiments. Finally, we use these images to perform multi-object tracking and video object segmentation. In this way, our system can generate masks of all the objects in the image sequence including the first image, where all the objects are close to each other. Fig. <ref type="figure">2</ref> illustrates an overview of our system. The collected images with their generated masks can be used to fine-tune existing methods for unseen object instance segmentation <ref type="bibr">[38,</ref><ref type="bibr">19]</ref> in order to improve their performance in the real world. We introduce each component of the system in the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Data Collection via Robot Pushing</head><p>Since our goal is to collect hard to segment images to finetune the segmentation networks, we intentionally put objects together for each scene in the beginning of the data collection process. After setting up a scene on a tabletop, a robot starts pushing these objects. A Fetch mobile manipulator is employed in our system, and an RGB-D image is captured before and after each push action, where we used the RGB-D camera on the Fetch robot to capture images.</p><p>Different from methods that carefully learn a pushing or grasping policy for singulation <ref type="bibr">[41]</ref>, we design a simple pushing strategy using object instance segmentation from the MSMFormer <ref type="bibr">[19]</ref> as input. This is because our system does not require all the objects to be singulated at the end of the interaction. As long as an object has been separated from other objects for a period of time during pushing, the system is able to generate correct segmentation masks for it thanks to the multi-object tracking and video object segmentation techniques utilized in the system. In cases where one push action cannot separate two objects if both objects move together, multiple push actions may separate them. Therefore, our system benefits from long-term robot interactions with a sequence of pushes.</p><p>Specifically, suppose at time t, the system captures an RGB-D image I t . We obtain a set of n t object segmentation masks {o i t } nt i=1 on I t by running the MSMFormer network on it. These masks are illustrated as the initial segmentation in Fig. <ref type="figure">2</ref>. Based on the object segmentation, the robot randomly selects an object to push. First, a 3D bounding box is computed for each segmented object by bounding the 3D point cloud of the object. Using the depth image and the camera intrinsic parameters, we can back-project the depth image into a 3D point cloud of the scene in the camera frame. Since we also know the camera pose in the robot frame, we can convert the point cloud into the robot frame. Using the segmentation mask of each object, we can extract the points of the object and compute a 3D bounding box for it in the robot frame. Second, according to center of the 3D bounding box, the robot decides to either push the object to the left or to the right. We select the pushing direction to always push the object towards the center of the robot, which prevents objects being pushed outside the reach of the robot. Third, a motion trajectory is planned to the left side (pushing right) or right side (pushing left) of the object. We used the MoveIt motion planning framework to plan the trajectories. Then the planned trajectory is executed to move the robot arm to the pushing location. Finally, the pushing action is achieved by adding an offset to the shoulder joint of the Fetch arm depending on the pushing direction.</p><p>Note that our pushing strategy cannot achieve perfect singulation results compared to learned polices or designed strategies for singulation. However, singulation is not our main goal. We also want to collect diverse datasets for learning. Our pushing strategy is effective to separate objects and perturb objects in the scene in order to generate diverse images. In addtion, although the initial segmentation has errors, it can still be used to guide the pushing process. A sequence of pushing actions and the generated images are shown in Fig. <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Optical Flow-based Multi-Object Tracking</head><p>After the data collection via robot pushing, we obtain a sequence of images I 1 , I 2 , . . . , I N with the corresponding initial segmented objects {o</p><p>, where N &#8776; 20 in our experiments. Since there are errors in these initial masks, our next task is to fix these errors and obtain correct segmentation masks for all the objects in the image sequence. Our idea is to leverage the observation that if a mask incorrectly includes more than one object, after a robot pushing, the mask will be broken down into multiple objects. On the other hand, if a mask correctly segments one object, after pushing, the mask will remain the same. However, one pushing action may not be able to singulate an object successfully. Therefore, we leverage a sequence of robot pushing actions in our system. In this case, if a mask remains the same after several pushing actions, it is highly likely to be a correct segmentation.</p><p>In order to compare the initial segmentation masks across image frames, we need to associate masks across frames. This problem is studied in the literature as tracking by detection <ref type="bibr">[43,</ref><ref type="bibr">4,</ref><ref type="bibr">36,</ref><ref type="bibr">27]</ref>. The most important component in a tracking-bydetection method is a similarity measurement between two object detections across video frames, which can be learned from data <ref type="bibr">[27]</ref> or defined using image features <ref type="bibr">[36]</ref>. In our system, since we do not have many data to learn the similarity measurement in robotic manipulation settings, we design one based on optical flow between image frames.</p><p>Let o i t1 be a mask on image I t1 and o j t2 be a mask on image I t2 . We would like to compute a similarity score between the two masks as s(o i t1 , o j t2 ). We only consider adjacent images in data association. Therefore, we can assume t 2 = t 1 + 1. We leverage optical flow between the two images to define the similarity score. Let o i t2 = o i t1 + f i t1,t2 be the propagated mask of object o i t1 to frame I t2 using forward flow f i t1,t2 . Similarly, we can propagate the mask of object o j t2 to frame I t1 using backward flow: o j t1 = o j t2 + f j t2,t1 . The similarity score between the two masks is defined as</p><p>where the IoU(&#8226;, &#8226;) function computes the intersection over union between two binary masks. Intuitively, one mask is propagated to another image using optical flow and compared to the other mask. Fig. <ref type="figure">3</ref> illustrates two examples of the computed matching scores. In case (a), at time t 1 , the initial segmentation cannot separate the corn and the salt bottle. The propagated mask to time t 2 cannot match the mask of the corn at time t 2 well. Therefore, the matching score is low. In case (b), the masks of the tomato match well using both the forward flow and the backward flow. The matching score is high. When the optical flow estimation is accurate, the similarity score in Eq. (1) serves as a good measurement for data association between objects. In our system, we use the RAFT <ref type="bibr">[30]</ref> network to compute optical flow.</p><p>With the above similarity score, we can leverage existing multi-object tracking methods such as network flow-based approaches <ref type="bibr">[43,</ref><ref type="bibr">27]</ref> or Markov decision process-based approaches <ref type="bibr">[36]</ref> to generate trajectories of objects across image frames. Instead, we found that a simple greedy search algorithm works well in the tabletop robot pushing settings since there are no long-term occlusions between objects or new objects coming in and out in these settings. The greedy data association algorithm starts from one mask in the last image frame I N . Then it associates the mask to a previous mask which has the highest matching score if their matching score is larger than a pre-defined threshold, and repeats this process until the highest matching score is smaller than the threshold. In this way, it generates a tracklet for one object.</p><p>After that, it selects a remaining mask and repeats the process to generate the next tracklet. We start the data association from the last frame in a backward way because objects are likely to be separated in the end of the robot pushing, which helps for object tracking.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Mask Propagation via Long-Term Object Segmentation</head><p>The output from the multi-object tracking algorithm is a set of tracklets {T i } M i=1 , where tracklet T i = (o i t1 , o i t2 , . . . , o i tm ) consists of a sequence of object masks from the initial segmentation. The lengths of these tracklets can be different. Majority masks in each tracklet correctly segment one object, since wrong initial segmentation masks have low matching scores as illustrated in Fig. <ref type="figure">3</ref>. If we can utilize the extracted tracklets and propagate the correct masks to all the image frames for all the objects, we can obtain correct segmentation masks for the collected data via robot pushing.</p><p>To achieve this goal, we utilize a state-of-the-art video object segmentation method named XMem <ref type="bibr">[10]</ref>. Given an initial mask of an object, XMem can segment the object in the following video frames. It maintains a memory buffer that stores the features of the target object, which enables it to segment the target in long video sequences and handle occlusions. In the traditional video segmentation scenarios, the initial mask of a target is given manually on the first video frame. In our case, we need to generate the initial mask automatically. It is critical to select a correct initial mask for an object. Otherwise, a wrong mask will be propagated to other frames. We utilize the observation that if a mask being pushed can still have high matching scores (Eq. ( <ref type="formula">1</ref>)) to the previous mask and the next mask in a tracklet, the mask is likely to contain a single object. Therefore, we select the pushed mask with the highest matching score as the initial mask to initialize XMem. The segmentation goes with two directions, where one goes to the first frame and the other one goes to the last frame in the collected image sequence. Fig. <ref type="figure">4</ref> shows two examples of the object segmentation with XMem. After all the tracklets are processed, the segmentation masks are combined to generate the final segmentation of the images (see Fig. <ref type="figure">2</ref>). In this way, our system can obtain segmentation masks of objects when they are very close to each other.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. APPLICATIONS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Transfer Learning for Object Segmentation</head><p>Our system can be used to collect images with the corresponding object segmentation masks in a self-supervised way. Then we can use these images to fine-tune the object segmentation networks to improve their performance. Since the collected data include correct segmentation masks when objects are very close to each other, the fine-tuned model is able to fix segmentation errors and correctly separate objects in cluttered scenes.</p><p>For the fine-tuning, we start with a segmentation model trained with synthetic data. We used MSMFormer <ref type="bibr">[19]</ref> in our experiments, which is also used to generate the initial segmentation masks for robot pushing. We initialize the network with the pre-trained weights on the synthetic data, and then train the network for a number of epochs on the collected real-world data with a smaller learning rate. We conducted an ablation study on different fine-tuning strategies. Specifically, the backbone of the network can be fixed or be trainable during fine-tuning. The fine-tuning data can be a mixture of synthetic images and real-world images or real-world images only. The effect of these strategies are presented in Section V.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Top-Down Robot Grasping</head><p>Unseen object instance segmentation can facilitate robot grasping of unknown objects as demonstrated in previous works <ref type="bibr">[21,</ref><ref type="bibr">22]</ref>. These methods use the segmented point clouds of objects to plan grasps for grasping. Improvement in object segmentation can benefit the grasp planning stage and improve the grasping performance subsequently. In this work, we show that using our collected data for fine-tuning can improve object segmentation and top-down grasping consequently. With accurate object segmentation, top-down grasp planning can be achieved in an analytic way. A top-down grasp for a two-finger gripper is defined as the 3D location p = (x, y, z), orientation &#952; of the gripper in the x, y plane and the width w between the two fingers, where axis-z is the gravity direction. The grasping position p is defined as the object center, where the object center is computed as the mean of the segmented point cloud of the object. The grasping orientation &#952; is computed to align the gripper with the second largest principal component of the object point cloud in the x, y plane. In this way, the robot can grasp the narrower side of a long object. Finally, the width between the two fingers is determined by the width of the object along the second largest principal component of the object point cloud in the x, y plane. It can be shown that if the center of mass of the object is the same as the object center, a grasp computed in this way can achieve force closure. The described grasp planning algorithm relies on accurate segmentation of all the objects in a scene. We can use it to verify the benefit of our system in collecting data to improve object segmentation for robot grasping.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EXPERIMENTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Datasets and Evaluation Metrics</head><p>Data Collected by the Robot. We used a set of play food for kids as the objects for robot interaction. For reproducibility, these objects can be purchased from <ref type="bibr">[1]</ref>. A Fetch mobile manipulator is used for data collection. Five different objects are used in each scene, and the robot performs around 20 pushing actions for each scene to collect images before and after each pushing action. In total, we collected images from 20 scenes. Images from 15 scenes are used for fine-tuning and the remaining images are used for testing the fine-tuned model in the same domain. Specifically, 321 images are used for finetuning, while 107 images are available for testing. Each image contains an average of 6 objects, but no more than 8 objects.</p><p>Evaluation Datasets. We evaluate the performance of our fine-tuned models on the pushing test dataset from our system, the Object Clutter Indoor Dataset (OCID) <ref type="bibr">[28]</ref> and the Object Segmentation Database (OSD) <ref type="bibr">[25]</ref>. The dataset from robot interaction is in the same domain as our collected data for finetuning, whereas OCID and OSD are in the different domains. The OCID dataset contains 2,390 RGB-D images, with at most 20 objects and on average 7.5 objects per image. The OSD dataset is composed of 111 RGB-D images, with up to 15 objects and an average of 3.3 objects per image.</p><p>Evaluation Metrics. We analyze the object segmentation performance through precision, recall, and F-measure <ref type="bibr">[39,</ref><ref type="bibr">38]</ref>.</p><p>To obtain the values for these three metrics, we initially calculate the values between all pairs of predictions and ground truth objects. Subsequently, we employ the Hungarian algorithm with pairwise F-measure to match predictions with the ground truth objects. Consequently, the precision, recall, and F-measure are determined by</p><p>, where c i indicates the segmentation for the predicted object i, g (c i ) is the segmentation for the corresponding ground truth object of c i , and g j denotes the segmentation for the ground truth object j. Overlap P/R/F are the above three metrics when the intersection over union between two segmentation masks is used to determine the amount of true positives. Boundary P/R/F are also used to measure the sharpness of the predicted boundary against the ground truth boundary, where the intersection pixels of the two boundaries determines the amount of true positives. Additionally, Overlap F-measure &#8805; 75% is the percentage of objects segmented with a certain accuracy <ref type="bibr">[23]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Ablation Studies on the Fine-tuning Strategies</head><p>We first investigate how to fine-tune the pre-trained segmentation networks with our collected real-world data. Regarding the training data for fine-tuning, we have two types of data: the 321 real-world images obtained via robot pushing and the synthetic images from the Tabletop Object Dataset <ref type="bibr">[39]</ref>. The synthetic dataset consists of 280,000 RGB-D images which is used for training most unseen object instance segmentation networks <ref type="bibr">[39,</ref><ref type="bibr">38,</ref><ref type="bibr">19]</ref>. In this work, we use the MSMformer model <ref type="bibr">[19]</ref> trained on the Tabletop Object Dataset for finetuning, since it achieves very competitive performance and is end-to-end trainable. MSMformer consists of two stages in segmenting objects, where the first stage segments the whole input image while the second stage performs zoom-in refinement for each segment from the first stage.</p><p>We have two choices on using these data for fine-tuning: i) using the real-world images only, ii) using both real-world images and synthetic images. On the other hand, we have two choices on how to fine-tune the backbone network in MSMFormer: i) fixing the backbone during fine-tuning, ii) fine-tuning the backbone. We conduct ablation studies on the four combinations and present the results on the OCID and the OSD datasets in Table <ref type="table">I</ref>. We fine-tune the models for 6 epochs as the training loss converges quickly, where each epoch loops over the 321 realworld images once. We employ the AdamW optimizer <ref type="bibr">[18]</ref> with the learning rate 1e-5. We set the batch size as 4. When using the mixture dataset for fine-tuning, for the first-stage model of MSMFormer, we randomly select 2 samples from the synthetic dataset and 2 samples from the real-world pushing dataset for each batch. For the second-stage model (zoom-in model), each batch has 3 random samples from the synthetic dataset and 1 pushing sample. Since the second stage model has 8 decoder layers, it tends to overfit to the real images due to its high complexity. Therefore, we use more synthetic images in a batch for the second-stage model.</p><p>Table <ref type="table">I</ref> shows that the performance of MSMFormer finetuned only using the small number of real-world pushing data is worse on the OCID dataset. This is due to overfitting to these real data. Using both the synthetic data and the real-world data for fine-tuning improves performance on both datasets. Using the mixture dataset is motivated by continual learning approaches such as <ref type="bibr">[2,</ref><ref type="bibr">3]</ref> which maintains a buffer of previously seen data. In our case, we can consider the synthetic dataset to be a data buffer. Table I also reveals that using learnable backbones achieves better performance than fixed backbones due to more flexibility in learning. According to these results, our fine-tuning strategy is to train the pretrained MSMFormer with mixture data and learnable backbones. We use this fine-tuning strategy in the following experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Ablation Studies on the Number of Fine-tuning Images</head><p>Our collected pushing training set has 15 scenes in total. We investigate the correlation between the number of images and the performance of the fine-tuned model. We partition the training set according to scenes and gradually add more scenes to the fine-tuning dataset. Table <ref type="table">II</ref> shows the performance of the MSMFormer models fine-tuned with datasets in different sizes. We can see that, the performance on the OCID and OSD datasets continually improves as the amount of scenes expands. After 12 scenes, the model performance begins to saturate. According to this experiment, a small number of real-world images for fine-tuning is sufficient, which avoids collecting a large number of images in the real world for finetuning. We use all the 15 scenes with 321 images for finetuning in the following experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Object Instance Segmentation in the Same Domain</head><p>Table <ref type="table">III</ref> presents the evaluation results on the 107 realworld test images of the models before and after fine-tuning. Since the pushing test dataset has the same settings as the finetuning dataset, we view the pushing test dataset as in same domain. It is clear that the fine-tuned models significantly improve the segmentation accuracy in the same domain. Imagine a robot entering a new domain, it can utilize our system to collect a few images to improve object segmentation in this new domain. We experiment fine-tuning both the RGB version and the RGB-D version of MSMFormer. In addition, we investigate the effect of fine-tuning on each stage of the segmentation network. "Zoom-in" in Table <ref type="table">III</ref> indicates the second-stage network. From the table, we can see that finetuning consistently improves the performance over the original models. The best performance is achieved by fine-tuning both stages of MSMFormer.</p><p>Generally, RGB-D models tend to surpass RGB models due to the additional depth input. However, we can observe that the fine-tuned two-stage RGB model (RGB with zoom-in) achieves the same Overlap F-measure and a higher Boundary F-measure compared to the fine-tuned two-stage RGB-D model. This result indicates that it is possible to segment unseen objects with RGB images only as long as we can obtain RGB training images with ground truth labels. Our system provides a solution by utilizing robot interaction for data collection. It is worth noting that using RGB images only is valuable since certain objects such as transparent objects or metal objects cannot be captured well by depth images.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Object Instance Segmentation across Domains</head><p>To evaluate the performance of the fine-tuned models across domains, we test them on the OCID and OSD datasets and compare the achieved results with the state-of-the-art methods in Table <ref type="table">IV</ref>. From the table, we can see that the fine-tuned models improve over the state-of-the-art methods on the OCID dataset for both RGB and RGB-D input. On the OSD dataset, UOAIS-Net <ref type="bibr">[5]</ref> achieves better performance for RGB-D input by utilizing photo-realistic synthetic images for training.</p><p>In most cases, the fine-tuning strategy consistently improves the pre-trained models with synthetic images. However, the RGB-D fine-tuned zoom-in refinement is not as effective as the original zoom-in refinement on the OCID dataset. The primary reason for this is that the environment and objects in our pushing dataset are simpler and more restricted than those presented in the OCID dataset. The combination of the finetuned first-stage model and the original zoom-in part is more effective on the OCID dataset. We visualize the differences of using the original models and fine-tuned models on different datasets in Fig. <ref type="figure">5</ref>. The fine-tuned models are able to separate adjacent objects to mitigate the under-segmentation problem in the same domain as the fine-tuning images as well as different domains in the OCID and OSD datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Top-Down Grasping with Object Instance Segmentation</head><p>We show the usefulness of the proposed system for grasping unknown objects in a table-top setting where the objects are   placed in a cluttered environment. A Fetch mobile manipulator is used for the experiments with its parallel jaw gripper for grasping, and built in RGB-D camera for perception. We compute the top-down grasp after segmenting all the objects in the scene via the procedure described in Section IV-B. We formulate the experiment as a pick-and-place task where the goal is to clear the table and place all the objects in a nearby bin. One example is shown in Fig. <ref type="figure">6</ref>.</p><p>The experiment is conducted with two sets of unknown objects (i.e., not seen during fine-tuning or training) with each set containing five objects. For each object set, we consider the pick-and-place task with four different initial configurations of the object placement on the table, ranging from highly cluttered to well separated as shown in Fig. <ref type="figure">7</ref>. The pickand-place grasping trials are conducted with the baseline 2 and fine-tuned 3 segmentation models with RGB-D input for each configuration to bring out the relative improvement of finetuning on data collected using the proposed method.</p><p>Given a configuration for object arrangement on the table, there are five pick-and-place trials associated with each of the five objects. A trial is counted as a success if a grasp of an object guided by its segmentation boundary allows for a successful pick-and-place operation, otherwise its counted as a failure. A hard-failure occurs for a scene if the segmentation masks are incorrect in the beginning, with the 5 objects in the scene. Such an error is not favorable due to possibility of collision and damage of the gripper and hence the grasping 2 Baseline: MSMFormer R34 + Zoom-in in Table V-A 3 Fine-tuned: MSMFormer R34* + Zoom-in* in Table V-A is stopped in this case, and none of the objects count towards the success rate metric. It potentially occurs if the segmentation model is not able to establish clear boundaries between nearby objects which induces errors in the grasping pipeline, specifically in positioning the gripper for picking up the object. For example, cases 1-A and 1-B with the baseline model in Table V are hard failures due to segmentation error at the very start. Consequently, no feasible grasping motion is found for any object in the scene and hence they have no score for the respective trials. Therefore, accurate object segmentation is critical for grasping in cluttered scenes.</p><p>We obtain data for the 40 individual trials (10 objects in total, across 4 table-top configurations for each) for each of the baseline and fine-tuned models and report their number of successful actions. As seen in Table <ref type="table">V</ref>, we see a clear improvement in the grasp success rate when using the fine-tuned model, especially in scenes with high clutter. This highlights the need for precise segmentation masks of objects in cluttered scenes as any errors in this stage likely affect downstream applications like grasping. Additional details and qualitative results will be provided in the supplementary material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSION AND FUTURE WORK</head><p>We introduced a robotic system for self-supervised unseen object instance segmentation. Our system leverages robot pushing to interact with objects and collect images before and after each pushing action. In order to generate segmentation masks of objects in the collected images, the system allows the robot to push objects until a sequence of images is collected, then an optical flow based multi-object tracking algorithm and a video object segmentation method are combined to segment object instances in the collected images automatically. Using a sequence of images from robot pushing enables the system to segment all the objects in the sequence including objects that are very close to each other. To the best of our knowledge, this is a first system that leverages long-term robot interaction for object segmentation.</p><p>We verify the usefulness of the system by using the collected images to fine-tune object segmentation networks. Our experiments show that the fine-tuned networks achieve better segmentation accuracy both in the same domain and in different domains. We also demonstrate that improving object segmentation with fine-tuning benefit top-down robot grasping in a pick-and-place task, where accurate object segmentation can be used to plan grasps in cluttered scenes.</p><p>For future work, we plan to extend the system beyond tabletop scenarios such as segmenting objects inside bins or cabinets. Robot interaction in these environments requires motion planning to account for the constraints from the environments. Robot pushing may not be sufficient in these environments. We plan to investigate different interaction actions such as grasping and scooping for data collection.</p></div></body>
		</text>
</TEI>
