<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>MOD-UV: Learning Mobile Object Detectors from Unlabeled Videos</title></titleStmt>
			<publicationStmt>
				<publisher>European Conference on Computer Vision</publisher>
				<date>09/29/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10545657</idno>
					<idno type="doi"></idno>
					
					<author>Yihong Sun</author><author>Bharath Hariharan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Embodied agents must detect and localize objects of interest, e.g. traffic participants for self-driving cars. Supervision in the form of bounding boxes for this task is extremely expensive. As such, prior work has looked at unsupervised instance detection and segmentation, but in the absence of annotated boxes, it is unclear how pixels must be grouped into objects and which objects are of interest. This results in over-/undersegmentation and irrelevant objects. Inspired by human visual system and practical applications, we posit that the key missing cue for unsupervised detection is motion: objects of interest are typically mobile objects that frequently move and their motions can specify separate instances. In this paper, we propose MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only. We begin with instance pseudolabels derived from motion segmentation, but introduce a novel training paradigm to progressively discover small objects and static-but-mobile objects that are missed by motion segmentation. As a result, though only learned from unlabeled videos, MOD-UV can detect and segment mobile objects from a single static image. Empirically, we achieve state-ofthe-art performance in unsupervised mobile object detection on Waymo Open, nuScenes, and KITTI Datasets without using any external data or supervised models. Code is available at github.com/YihongSun/MOD-UV.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Fig. <ref type="figure">1</ref>: Our approach, MOD-UV, learns from unlabeled videos in Waymo Open <ref type="bibr">[46]</ref> only and can reliably detect and segment mobile objects from a single input image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Embodied agents such as self-driving cars must detect and localize objects of interest such as traffic participants to operate safely and effectively. Today, building such a detector requires the expensive and laborious annotation of millions of boxes over thousands of images. This process is so expensive that the largest detection dataset is orders of magnitude smaller than classification datasets and has much fewer classes. The limited set of classes further runs the risk of missing important object categories (e.g. snowplows for self-driving applications).</p><p>These concerns have motivated research into unsupervised object detection techniques that automatically discover objects from unlabeled data <ref type="bibr">[6,</ref><ref type="bibr">54,</ref><ref type="bibr">55]</ref>. Under the hood, these techniques use self-supervised features to segment unlabeled images and produce candidate object annotations which are then used to train a detector. However, while promising, these approaches often produce many uninteresting and irrelevant "objects" in cluttered scenes (e.g. buildings and roads) and over-or under-segment objects of interest (e.g. multiple detections partitioning a large bus or a single detection grouping a row of parked cars). These failures shouldn't be surprising: after all, how can a completely unsupervised feature representation encode which assortment of windows, doors and wheels belongs together as an object, and which objects are of interest?</p><p>In this paper, we argue that a key missing cue for addressing the aforementioned issues in unsupervised instance detection is motion. In a practical sense, objects of interest are commonly mobile objects that frequently move. For example, robots performing navigation tasks must plan their trajectories carefully around objects that can move of their own volition. Thus, we argue that if we see similar groups of pixels frequently move of their own volition in unlabeled videos (e.g. vehicles and pedestrians in driving videos), this is sufficient information for building a mobile object detector that can detect such instances in static frames.</p><p>The importance of motion as a perceptual cue (the Gestalt principle of common fate) is well known <ref type="bibr">[35]</ref>. Indeed, motion-based grouping is one of the first forms of grouping to appear developmentally in human infants <ref type="bibr">[45]</ref> and can bootstrap other grouping cues <ref type="bibr">[34]</ref>. There is also some prior work on using motionbased grouping in computer vision to produce (pseudo) ground-truth for feature learning <ref type="bibr">[36]</ref> and discovering isolated, salient objects <ref type="bibr">[10,</ref><ref type="bibr">11]</ref>. However, there is a big-gap between motion segmentation and the kind of ground-truth we need for training a full-fledged instance-level object detector. First, motion segmentation produces a binary segmentation; this must be resolved into individual instances. Second, it only identifies moving objects and does not include objects (e.g. parked cars) that are static but mobile. Finally, motion segmentation only identifies nearby objects, since the pixel motion of faraway objects is too subtle to discern. Thus motion segmentation alone will still under-segment and miss many mobile objects: a problem for building object detectors.</p><p>Here we propose a new training scheme to address these challenges. Our approach (MOD-UV, a Mobile Object Detector learned from Unlabeled Videos only; Figure <ref type="figure">1</ref>) trains on unlabeled videos alone and produces a mobile object detector that can run on static frames. We first generate pseudo training la-bels from motion segmentation estimated by our prior unsupervised framework Dynamo-Depth <ref type="bibr">[47]</ref>. We then propose a new training scheme to address the challenges above, resulting in a final mobile object detector that detects 12&#215; more mobile objects than the initial motion segmentation.</p><p>We test MOD-UV on self-driving scenes but evaluate on a variety of datasets. Specifically, we compare to recent state-of-the-art unsupervised object detectors and demonstrate improvements across the board, with notable improvements in Box AR by 6.6 on Waymo Open <ref type="bibr">[46]</ref>, 4.9 on nuScenes <ref type="bibr">[4]</ref> and 6.2 on KITTI <ref type="bibr">[17]</ref>.</p><p>In sum, our contributions are:</p><p>1. We argue that motion as a cue is sufficient for unsupervised training of instance-level object detectors. 2. We propose a new training scheme that trains on unlabeled videos to produce a mobile object detector that can run on static images. 3. We demonstrate marked improvements over unsupervised object detection baselines across a range of datasets and metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Unsupervised Object Detection/Discovery from Images. Learning to identify and localize objects from unlabeled images is a challenging task, since object information must be obtained without any explicit human annotations.</p><p>A long line of work seeks to discover prominent objects in large image collections <ref type="bibr">[9,</ref><ref type="bibr">[51]</ref><ref type="bibr">[52]</ref><ref type="bibr">[53]</ref>. However, these approaches are fundamentally limited by the quality of the object proposals. More recently, Locatello et al . <ref type="bibr">[30]</ref> and DI-NOSAUR <ref type="bibr">[38]</ref> consider object discovery as object-centric learning <ref type="bibr">[18]</ref> and decompose a complex scene into independent objects. Nevertheless, the reconstruction objective is difficult to scale and likely to discover irrelevant patches as well.</p><p>More recent work has relied on the fact that bottom-up segmentation algorithms when applied to self-supervised pretrained representations yield good object proposals. Specifically, pseudo mask labels can be generated from DINO <ref type="bibr">[7]</ref> features to train downstream object detectors <ref type="bibr">[41,</ref><ref type="bibr">42,</ref><ref type="bibr">54]</ref>. MaskDistill <ref type="bibr">[50]</ref> extends upon this by distilling from affinity graph produced by DINO <ref type="bibr">[7]</ref> features, while TokenCut <ref type="bibr">[56]</ref> and CutLER <ref type="bibr">[55]</ref> use Normalized Cuts <ref type="bibr">[39]</ref>. HASSOD <ref type="bibr">[6]</ref> leverages hierarchical adaptive clustering, which improves the detection of small objects and object parts. This line of work now produces detectors that can run on static images, similar to our work. However, the detected objects can often be irrelevant (e.g. buildings and road) or over-/under-segment objects of interest. In contrast, our proposed MOD-UV discovers and detects a more meaningful and practical set of mobile objects instead, and can learn from their apparent motion in unlabeled videos only without relying on any additional datasets.</p><p>Unsupervised Object Detection/Discovery from 3D. In addition to unlabeled images, 3D information is also useful for discovering objects. Herbst et al . <ref type="bibr">[21]</ref> and MODEST <ref type="bibr">[62]</ref> discover non-persistent objects via multiple traversals with 3D sensors. Garcia et al . <ref type="bibr">[16]</ref> discovers salient objects by late-fusing color and depth segmentation from RGB-D inputs, while Tian et al . <ref type="bibr">[49]</ref> generates candidate segments from LiDAR 3D point clouds. In comparison, MOD-UV does not require any additional sensors or modalities beyond unlabeled videos, which allows our method to work in more general settings.</p><p>Unsupervised Object Detection/Discovery from Videos. Inspired by Gestalt principle of common fate <ref type="bibr">[35]</ref>, another class of related work discovers objects via their apparent motions observed in videos <ref type="bibr">[31,</ref><ref type="bibr">59,</ref><ref type="bibr">61]</ref>. By leveraging optical flow information from an input video, a binary segmentation of the moving objects can be extracted <ref type="bibr">[25,</ref><ref type="bibr">26,</ref><ref type="bibr">36,</ref><ref type="bibr">43,</ref><ref type="bibr">57,</ref><ref type="bibr">60,</ref><ref type="bibr">63,</ref><ref type="bibr">64]</ref>. Lian et al . <ref type="bibr">[28]</ref> proposes further improvements for cases of articulated/deformable objects and shadow/reflections by relaxing the common fate assumption. Another line of work uses a reconstruction objective to identify the moving object <ref type="bibr">[1,</ref><ref type="bibr">48]</ref>. Du et al . <ref type="bibr">[13]</ref> models explicit object geometry and physical dynamics by exploiting motion cues. Bao et al . <ref type="bibr">[2]</ref> improves training for object-centric representation via an additional motion segmentation regularization, while SAVi++ <ref type="bibr">[14]</ref> incorporates LiDAR data when training an object-centric video model. Unlike MOD-UV, these approaches do not build a static image detector. However, the output segmentation can be used as an initialization for our approach.</p><p>Closer to our work, Pathak et al . <ref type="bibr">[36]</ref>, Croitoru et al . <ref type="bibr">[11]</ref> and Choudhury et al . <ref type="bibr">[10]</ref> train a single-frame binary segmentation network on video frames as input and leverage object motion as supervision. Furthermore, LOCATE <ref type="bibr">[44]</ref> applies graph-cut to obtain binary motion mask from DINO <ref type="bibr">[7]</ref> and optical flow feature similarities, which in turn is treated as pseudo-labels for bootstrapped self-training of a downstream segmentation network. However, these techniques can only detect a single salient object per frame. In contrast, MOD-UV generalizes to multi-object detection beyond single-object saliency detection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Method</head><p>Problem setup: We assume an uncurated collection of unlabeled videos as input. In particular, we assume that these videos are obtained by an embodied agent observing, and optionally acting in the world. Solely from the unlabeled videos, the goal is to learn a detector that operates from a single frame and can detect and segment all mobile objects that can move of their own volition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Initialization with Unsupervised Motion Segmentation</head><p>A key insight in MOD-UV is that if an object can move, it is likely that it does move many times in the collected data. Thus, we start by identifying moving objects in the videos; they can be initial seeds for learning about mobile objects.</p><p>Fortunately, the task of identifying independently moving pixels from unlabeled videos is a well-studied one <ref type="bibr">[3,</ref><ref type="bibr">37,</ref><ref type="bibr">40,</ref><ref type="bibr">58]</ref>. In particular, many recent techniques have been proposed that learn motion segmentation without supervision from unlabeled videos. Many of these techniques also produce depth and camera motion <ref type="bibr">[23,</ref><ref type="bibr">27,</ref><ref type="bibr">32,</ref><ref type="bibr">37]</ref>. Here, we use our prior work, Dynamo-Depth <ref type="bibr">[47]</ref>. Dynamo-Depth trains on unlabeled videos and learns both a monocular depth estimator as well as a motion segmentation network. We use the outputs of these trained networks on our unlabeled videos as a starting point. Concretely, we denote the input set of unlabeled videos as {v i }, with each video v i containing consecutive frames I 1 , . . . , I n and known camera intrinsics. For each frame I i , we obtain its estimated motion mask m i and estimated monocular depth d i .</p><p>With the given binary motion mask m i , we first need to partition the moving pixels into instance-level labels. While disjoint moving regions can be easily separated, multiple moving objects in the same region would require additional information (e.g. 3D information) to separate. Therefore, for each image I i , we project the corresponding moving pixels in m i into pseudo 3D point clouds P i via the estimated monocular depth d i and inverse camera intrinsics K -1 . 1</p><p>Then, we cluster P i via DBSCAN <ref type="bibr">[15]</ref> to get a pseudo depth-aware partition of the motion mask m i , which we treat as the initial pseudo-labels, L</p><p>i . We evaluate the quality of these pseudo-labels qualitatively in Figure <ref type="figure">2</ref> and quantitatively in the top rows of Table <ref type="table">4</ref>. We find that these pseudo-labels have high precision, but have two severe limitations. First, they only identify moving objects, so they miss objects that are static but can move (e.g. parked cars) Second, they miss almost all faraway objects which tend to be small. This is because the apparent pixel motion of faraway objects is very hard to detect.</p><p>To tackle this issue of limited recall, we propose two self-training stages, Moving2Mobile and Large2Small , that progressively recover more mobile objects in the scene by aligning the training distribution of static and small objects with the available large moving objects in the initial pseudo-labels, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Self-Training for Unsupervised Mobile Object Detection.</head><p>Here, we describe each self-training stage of MOD-UV, as we progressively discover mobile objects to train the final mobile object detector.</p><p>Moving2Mobile: Learning to Detect Static Objects. On the left of Figure <ref type="figure">2</ref>, the initial pseudo-labels L (0) i , while having high precision, fail to capture the large static objects, e.g. the black sedan in the bottom. However, a parked black sedan looks the same as a moving black sedan if all one has is a single frame. In other words, moving and static objects are indistinguishable when observed from a single frame.</p><p>Thus, in Moving2Mobile, we simply train a detector to reproduce the pseudolabeled instances in L (0) i , but with only a single frame as input. Since object 1 -&#8594; &#8226; denotes the conversion to homogeneous coordinates motion is not apparent in a single frame, this detector cannot distinguish moving objects from static ones and thus is forced to detect anything that share the appearance of moving objects, thus detecting static mobile objects as well.</p><p>However, there may exist domain-specific statistical regularities that give hints to object motion even in a single frame. For example, lit-up tail-lights might indicate that the car is stopped, while a highway background might suggest that the cars are moving. To prevent the detector from overfitting to these priors, we stop training early. Afterwards, we treat the high confidence predictions by the detector as the pseudo-labels for the next round L</p><p>(1) i .</p><p>Large2Small : Learning to Detect Small Objects. As shown in Figure <ref type="figure">2</ref>, L</p><p>i , pseudo-labels after Moving2Mobile, appropriately recovers the large static objects, however, the smaller objects remain absent. Intuitively, faraway objects have much smaller apparent pixel motion (and thus are absent from L (0) i ) and also look different from large moving objects (and thus are absent from L (1) i ). To learn to detect small objects, we create a new training dataset by scaling down both the image and the pseudo-labels (while also padding the image to maintain image size). We then train a separate "small object" detector by training on this new dataset. Intuitively, by training on the scaled down training pair, the output detector would need to detect the same object at a much lower scale, directly promoting the extension to small objects. Also, since L</p><p>(1) i came from a heavily-regularized detector, we maintain and finetune the pseudo-labels for larger objects by training a second detector from scratch in parallel, on the training pair (I i , L</p><p>i ) without down-scaling or padding. Notably, this is different from traditional scale jittering, since the singular de-tector would be discouraged from detecting small objects at larger scales due to the limitations in L</p><p>(1) i . Upon convergence, we have a Large-object detector trained at original scale and another Small-object detector trained at a reduced scale. After aggregating their predictions and resolving conflicting proposals, we obtain the final pseudo-labels, L</p><p>i . We note that separating out large and small object detectors in this way has been explored in supervised face detection <ref type="bibr">[22]</ref>.</p><p>i , the final pseudo-labels after Large2Small , successfully recovers both static and small objects without introducing excessive false-positives. From here, we train the final detector from scratch, on the training pair</p><p>i ) to convergence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Implementation details</head><p>We follow the official code release by Dynamo-Depth <ref type="bibr">[47]</ref> and train the system on Waymo Open <ref type="bibr">[46]</ref>. During initial pseudo-label generation, we binarize the estimated motion mask via a threshold of 0.1 and cluster the pseudo 3D points P i via DBSCAN <ref type="bibr">[15]</ref> using a 10-by-10 local pixel neighborhood connectivity.</p><p>We adopt Mask R-CNN <ref type="bibr">[19]</ref> with a ResNet-50 <ref type="bibr">[20]</ref> backbone as the detector architecture. We initialize the backbone via two strategies, namely MoCo v2 <ref type="bibr">[8]</ref> on randomly sampled Waymo <ref type="bibr">[46]</ref> patches and MoCo v2 on ImageNet <ref type="bibr">[12]</ref>, denoted as MOD-UV &#8225; and MOD-UV, respectively. We use Adam optimizer <ref type="bibr">[24]</ref> with initial learning rate of 1e-4 and decay by 1  2 after 10 epochs. During Moving2Mobile, we train a detector for 3 epochs, with scale jittering from 0.5 to 1.0. Since early-stopping is applied, we adopt a lower confidence threshold of 0.5 to compute the next round pseudo-labels L (1) i . During Large2Small , we train both the large and small detectors for 20 epochs, with fixed scaling at 1.0 and 0.25, respectively. As both are trained to convergence, we adopt a higher confidence threshold of 0.9 and 0.8, respectively, to compute the next round pseudo-labels L</p><p>(2) i with aggregation. For Final round, we train the detector from scratch for 20 epochs, with scale jittering from 0.5 to 1.0. The self-training in MOD-UV takes 27 hours on 1 NVIDIA A6000 GPU.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental Setup</head><p>Datasets. For evaluation, we focus our attention to self-driving datasets since uncurated, unlabeled video data is available and detecting mobile objects is of interest for autonomous vehicles. We train both MOD-UV &#8225; and MOD-UV on Waymo <ref type="bibr">[46]</ref> and compare our performance with baselines on Waymo. We then also evaluate generalization to nuScenes <ref type="bibr">[4]</ref>, KITTI <ref type="bibr">[17]</ref>, and COCO <ref type="bibr">[29]</ref>.</p><p>We split the 798 sequences from Waymo train set into 762 for training and 36 for validation. After method development concludes, we evaluate on the held-out 1,881 test images (averaging 28.4 mobile instances per image) from Waymo val set <ref type="bibr">[33]</ref>. Additionally, nuScenes, KITTI, and COCO are only used for evaluating generalization. We test on 3,249 front-camera images (average of 8.2 mobile instances per image) from the nuImage validation set for nuScene and 7,481 images (average of 6.9 mobile instances per image) in the 2D Detection training set for KITTI. For COCO, we evaluate on 870 images (average of 3.8 mobile instances per image) in COCO val 2017 that contain ground vehicles.</p><p>Baselines. For the task of unsupervised mobile object detection, there is no directly comparable baselines to the best of our knowledge. Therefore, we consider methods for unsupervised object detection, namely CutLER <ref type="bibr">[55]</ref> and HASSOD <ref type="bibr">[6]</ref>, as the closest points of comparison.</p><p>CutLER <ref type="bibr">[55]</ref> uses normalized cuts on DINO features (trained on ImageNet) to generate pseudo labels that are used to train a Cascade Mask R-CNN <ref type="bibr">[5]</ref> with ResNet-50 backbone. We evaluate the official checkpoint. Furthermore, we consider an additional baseline where L</p><p>(1) i is directly predicted via CutLER, which effectively ablates the use of motion cues, as denoted by CutLER L2S .</p><p>HASSOD <ref type="bibr">[6]</ref> is a follow-up to CutLER. It discovers objects on COCO [29] via a hierarchical adaptive clustering of DINO features (trained from ImageNet). The hierarchical clustering yields three "levels": objects, object parts, and object sub-parts. For consistency, we consider all three hierarchical levels for evaluation. As before, we use its official released checkpoint. Since MOD-UV &#8225; is trained on Waymo Open <ref type="bibr">[46]</ref>, we also consider a version of HASSOD solely trained on Waymo, which we denote as HASSOD &#8224; .</p><p>CutLER <ref type="bibr">[55]</ref> and HASSOD <ref type="bibr">[6]</ref> are trained to detect all objects in an image regardless of their ability to move. However, our evaluation and approach is focused on mobile objects. We therefore also consider an "oracle" version of these baselines where we additionally remove any CutLER and HASSOD predictions that overlap by less than 0.1 in IoU with the ground truth instances. These oracles, namely CutLER * , CutLER L2S * , HASSOD * , and HASSOD &#8224; * , are grayed out in the tables to indicate additional ground-truth-based filtering.</p><p>We also consider a fully-supervised Mask R-CNN (Sup. Mask R-CNN) trained on COCO <ref type="bibr">[29]</ref> for an oracle comparison, marked in gray.</p><p>Metrics. We evaluate both Average Recall (AR Box 100 and AR Mask 100 ) and Average Precision (AP Box and AP Mask ). Since the task is unsupervised and no semantic information is given during training, we follow prior arts <ref type="bibr">[6,</ref><ref type="bibr">55]</ref> and evaluate class-agnostic AR and AP by treating all predicted and ground truth instances as a single class of "foreground" or "mobile" objects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Unsupervised Mobile Object Detection and Segmentation</head><p>In-Domain Performance on Waymo. Table <ref type="table">1</ref> compares MOD-UV &#8225; and MOD-UV against CutLER, HASSOD, and HASSOD &#8224; in unsupervised mobile object detection on Waymo. We also report performance for a MaskR-CNN trained on COCO as an oracle in the first row of Table <ref type="table">1</ref>.</p><p>Recall. We report AR 0.5 (average recall with IoU= 0.5), AR, AR S , AR M , and AR L for both box and mask predictions. MOD-UV significantly outperforms prior arts across all recall metrics except the recall for large objects where it is comparable. The improvement is especially large for small objects (4.7&#215; higher AR Box S than the nearest competitor). Compared to a supervised Mask R-CNN trained on COCO <ref type="bibr">[29]</ref>, MOD-UV closes the gap in Box AR S from 11.6 to 4.8. Our gains are also much larger on the AR 0.5 metric (nearly 2&#215; prior state-of-theart). This suggests that we detect significantly more objects than prior work, but their localization can be improved. Even so, we still show a 6-point improvement on overall AR. We also found HASSOD to underperform when trained solely on Waymo, which we suspect is due to the uncurated nature of self-driving scenes.</p><p>Precision. We report AP at an overlap threshold of 0.5, as well as AP, AP S , AP M and AP L . Since CutLER and HASSOD are trained to detect all objects in an image regardless of their ability to move, we also compare to oracle versions of these techniques with ground-truth-based filtering. On Waymo, MOD-UV significantly outperforms prior arts across all precision metrics. Even with groundtruth filtering (in gray), MOD-UV still consistently outperforms baselines on all precision metrics (except for larger objects where it is comparable). Specifically, MOD-UV outperforms the nearest competitor (with ground-truth filtering) by . Intriguingly, compared to a supervised Mask R-CNN trained on COCO <ref type="bibr">[29]</ref>, MOD-UV closes the gap in Mask AP S from 4.3 to 0.8 points.</p><p>Generalization to Out-of-Domain Data. We next take our detector trained on Waymo, and apply it out of the box on nuScenes, KITTI, and COCO.</p><p>Recall. As shown in Table <ref type="table">2</ref> and Table <ref type="table">3</ref>, on nuScenes and KITTI, MOD-UV consistently outperforms prior arts across all AR metrics except for large objects, achieving a more than 1.5&#215; improvement on AR 0.5 over the nearest competitor on nuScenes. MOD-UV also shows large gains on small objects, improving AR Box S by 2.4&#215; on nuScenes and over 1.7&#215; on KITTI.</p><p>Finally, we evaluate on COCO, which is in-domain for HASSOD and a big domain shift for MOD-UV. Notably, MOD-UV maintains superiority on AR S and AR at IoU= 0.5, while being comparable to HASSOD on AR and AR M .</p><p>Precision. This improvement is also seen in AP. On both nuScenes and KITTI, MOD-UV consistently outperforms baseline on all AP metrics except being comparable for AP L with prior arts with ground-truth filtering. Specially, MOD-UV improves upon HASSOD * (with ground truth filtering) on Box AP by 1.2 on nuScenes and 2.6 on KITTI, with notable improvements on AP Box S by over 1.8&#215; on nuScenes and 2.6&#215; on KITTI. Even on COCO, MOD-UV outperforms prior arts on all metrics without ground-truth filtering except AP L . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Qualitative Results</head><p>In Figure <ref type="figure">3</ref>, we show qualitative examples of MOD-UV against CutLER and HASSOD after ground truth filtering, which we denote by CutLER * and HASSOD * , respectively. In addition, we highlight the regions containing small objects with an additional zoom-in. Without using any annotations, MOD-UV detects mobile objects accurately, especially recovering many more small and faraway objects compared to prior arts. In contrast, due to the reliance on image features from static images, both CutLER and HASSOD tend to group multiple objects into a single proposal (seen in the second row in Waymo).</p><p>Notably, MOD-UV reliably detects static and small mobile objects in the scene without excess amount of false positives. This improvement mostly originates from the proposed Moving2Mobile and Large2Small (see ablation below).</p><p>Beyond accurate detection and segmentation on Waymo, MOD-UV demonstrates impressive generalization when applied on nuScenes, KITTI, and COCO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Ablation Study</head><p>We conduct ablation studies with MOD-UV &#8225; trained solely from Waymo <ref type="bibr">[46]</ref> to understand the effects of each proposed component, including pseudo-label generation, static object discovery, small object discovery, and final round training. Motion Cues for Initial Pseudo-Labels. As shown in Table <ref type="table">1</ref>, there is a consistent AR improvement for small and medium objects in CutLER L2S over CutLER. This highlights the effectiveness of our Large2Small strategy in Table <ref type="table">4</ref>: Ablation Study on the processing of pseudo-labels with MOD-UV &#8225; . We report pseudo-label mask quality in terms of AR Mask on the training set of Waymo Open <ref type="bibr">[46]</ref>.  improving detector performance on small objects, which is a challenge for all unsupervised detectors/object discovery techniques because of the limited signal on small objects. Despite of this gain, MOD-UV still outperforms CutLER L2S because motion offers a stronger cue to separate small objects that appear close to each other in pixel space, as shown in Figure <ref type="figure">3</ref>.</p><p>Pseudo-Label Generation. In Table <ref type="table">4</ref>, we measure the quality of the pseudolabels in terms of AR for All, the Static, and the Moving instances.</p><p>In the top of Table <ref type="table">4</ref>, compared to using 2D contours for generate pseudolabels, clustering pseudo 3D points from monocular depth estimations improves Moving AR L by 9.4 and Moving AR by 2.5. This underlines the benefit in leveraging 3D information for partitioning moving instances. Nevertheless, the Static AR is notably smaller, with Static AR L only up to 10% of Moving AR L . Also, small and medium objects are almost entirely missed, with All AR S at 0.0 and All AR M at 0.5, compared to All AR L at 10.0. These observations further verify the bias pointed out in Section 3.1, where static and small objects are mostly absent in the initial pseudo-labels generated from motion segmentation.</p><p>Furthermore, in the bottom half of Table <ref type="table">4</ref>, we demonstrate the effectiveness of self-training in the Moving2Mobile stage in recovering static objects from the initial pseudo-labels with varying number of training epochs. It is worth noting that regardless of convergence, self-training successfully improves Static AR, with improvements as high as 23.4 on Static AR L . Interestingly, as the initial pseudo-labels contain mostly moving objects, additional training beyond 3 epochs shows clear degradation in performance (reducing Static AR L by 5.0 while retaining Moving AR L ), as the trained detector overfits to moving instances by exploiting contextual priors. In addition to improvements on All AR L , the Moving2Mobile stage is also able to slightly lift up All AR M by the highest at 4.1. Nevertheless, with no improvements on AR S , it is clear that the Moving2Mobile stage alone cannot alleviate the negative bias from motion segmentation.</p><p>Self-Training Pipeline. In Table <ref type="table">5</ref>, we evaluate every combination of the 3stage self-training pipeline with MOD-UV &#8225; to evaluate the effectiveness of each. Here, the Moving2Mobile stage is again shown to be essential for recovering static objects from the initial pseudo-labels. When Moving2Mobile is ablated, performance decrease across all combinations, with notable reductions in AR by 4.9 and AP by 3.9 when solely ablated from MOD-UV &#8225; .</p><p>Additionally, when Large2Small is ablated, AR S reduces by nearly 4&#215; and AP S by 3&#215;, underlining its importance for small object detection. Lastly, the final self-training round effectively learns the aggregated proposals from Large2Small , leading to an improved Box AR by 8.1 from Large-object detector trained at original scale and by 4.1 from Small-object detector trained at reduced scale.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We argue that motion is an important cue for unsupervised object detection, and propose the task of unsupervised mobile object detection. We propose a new training pipeline, MOD-UV, that bootstraps from motion segmentation but removes its bias by discovering static and small objects. MOD-UV achieves significant improvement over prior self-supervised detectors on multiple datasets.</p><p>Limitations. Our work makes an assumption that all mobile objects would often move in the given unlabeled video dataset. Although MOD-UV can ideally learn and detect all mobile objects, in practice, the learning-based framework can only learn and detect things that frequently move in the videos. That said, for general applications where the autonomous agents can manipulate their surroundings, the agent can still learn to detect rarely moving objects by interacting with their environment, e.g. poking at static objects within reach.</p><p>Societal Impact. Being an unsupervised detection framework, our work does not include any negative social impacts beyond object detection itself. A Implementation Details</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1 Mask Aggregation</head><p>When aggregating the predictions of "large object" detector, M L , and "small object" detector, M S , we found that M S contains mostly parts of large objects and small objects, while M L contains mostly large objects and groups of small objects. Intuitively, NMS is less-suited for this aggregation task due to the presence of object parts and groups. Thus, we implement our aggregation as shown in Algo. 1. Here, we first filter out smaller overlapping masks in M L (e.g. ones that are covered by larger masks by more than filtFrac of 0.75) as the smaller objects should be found via M S instead. We also filter out larger overlapping masks in M S with the same filtFrac as the larger objects should be found via M L instead.</p><p>After directly matching the masks in M L and M S with a matchThrd of 0.5, if a subset of proposals in M S sufficiently covers a mask in M L (over a coverFrac of 0.5), then we consider the large proposal to likely be a group of instances and only keep the subset. Conversely, we consider the subset to likely be object parts and keep the large proposal.</p><p>As shown in Table <ref type="table">8</ref>, our aggregation approach improves upon NMS with a matchThrd of 0.5 by 1.8 in AR 0.5 and 1.5 in AP 50 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Evaluation Datasets</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 Waymo Open Dataset</head><p>We evaluate performance on Waymo via all images in the val set of Waymo Open Dataset <ref type="bibr">[46]</ref>. We obtain the instance-level object masks via the panoptic annotations <ref type="bibr">[33]</ref>, and treat the following object categories to be mobile: car, truck, bus, other_vehicle, bicycle, motorcycle, trailer, pedestrian, bicyclist, motorcyclist, bird, and ground_animal.</p><p>In total, there are 1,881 test images with 53,387 mobile instances labeled.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 nuScenes Dataset</head><p>We evaluate performance on nuScenes via all FRONT camera images in the val set of nuImage <ref type="bibr">[4]</ref>. We consider all object categories under the super-categories (animal,human, and vehicle) to be mobile. In total, there are 3,249 test images with 26,618 mobile instances labeled.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 KITTI Dataset</head><p>We evaluate performance on KITTI via all images in the train set of KITTI 2D Detection Dataset <ref type="bibr">[17]</ref>. We consider all object categories labeled in the dataset to be mobile. In total, there are 7,481 test images with 51,865 mobile instances labeled. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 COCO Dataset</head><p>We evaluate performance on COCO via all images that contain street vehicles in the val2017 set <ref type="bibr">[29]</ref>. We consider the following object categories to be mobile: car, truck, bus, bicycle, motorcycle.</p><p>In total, there are 870 test images with 3,319 mobile instances labeled. For reference, the original val2017 set contains 5,000 images with 36,781 labeled instances (regardless of mobility).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Additional Quantitative Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Complete Metrics</head><p>For additional performance metrics on Waymo Open <ref type="bibr">[46]</ref>, please refer to Table <ref type="table">6</ref> that corresponds to Table <ref type="table">1</ref>. Also, nuScenes <ref type="bibr">[4]</ref> performances can be found in Table <ref type="table">7</ref> (corresponding to Table <ref type="table">2</ref>). Finally, KITTI <ref type="bibr">[17]</ref> and COCO <ref type="bibr">[29]</ref> results are found in Table <ref type="table">9</ref> (corresponding to Table <ref type="table">3</ref>).</p><p>Interestingly, we found that motion segmentation trained from a reconstruction objective <ref type="bibr">[47]</ref> contains specific biases, such as the inclusion of object shadows and missing object parts due to smooth object regions. This results in localization errors and reduced performances for large objects for IoU&gt;0.75. As shown in the Table <ref type="table">6</ref>, MOD-UV outperforms all prior art even for large objects for IoU=0.5 on Waymo Open.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Mask Aggregation</head><p>In Table <ref type="table">8</ref>, we evaluate the effects of different mask aggregation techniques on the performance of the final detector. Notably, different strategies have minimal effects on Box AR and Box AP, which suggests the robustness of the mask aggregation step. Additionally, we do observe small performance gain for medium objects against the Non-Maximum Suppression (NMS) technique and small performance gain for large objects against the use of lowered confidence thresholds, "0.5/0.5" instead of "0.9/0.8".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.3 Backbone Pre-training.</head><p>In Table <ref type="table">10</ref>, we evaluate the effects of different backbone initializations on our proposed learning scheme. Since MOD-UV depends on multiple rounds of selftraining, training detectors with a backbone initialized from scratch (&#8709;) reduces performance significantly. Interestingly, the use of any particular pre-training technique is less influential, as MOD-UV demonstrate similar performances.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.4 Reproducibility and Hyperparameter Selection</head><p>To ensure reproducibility, we repeat the experiment for MOD-UV &#8225; 3 times with randomly generated seeds and obtained a 95% confidence interval of 16.9 &#177; 1.3 and 10.3 &#177; 1.5 for Box AR and Box AP, respectively on Waymo Open. Please refer to Table <ref type="table">11</ref> for standard deviations for each metric. The number of epochs for Moving2Mobile, scale jittering rates, and confidence thresholds for self-training were found on a small Waymo validation set during the development of our paper, while the test set is held-out entirely. The rest of the hyperparameters, including number of epochs, learning rates, and decay, were set arbitrarily and not tuned since training converged.</p></div></body>
		</text>
</TEI>
