<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>MultiBodySync: Multi-Body Segmentation and Motion Estimation via 3D Scan Synchronization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10285240</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Jiahui Huang</author><author>He Wang</author><author>Tolga Birdal</author><author>Minhyuk Sung</author><author>Federica Arrigoni</author><author>Shi-Min Hu</author><author>Leonidas Guibas</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We present MultiBodySync, a novel, end-to-end trainable multi-body motion segmentation and rigid registration framework for multiple input 3D point clouds. The two non-trivial challenges posed by this multi-scan multibody setting that we investigate are: (i) guaranteeing correspondence and segmentation consistency across multiple input point clouds capturing different spatial arrangements of bodies or body parts; and (ii) obtaining robust motion-based rigid body segmentation applicable to novel object categories. We propose an approach to address these issues that incorporates spectral synchronization into an iterative deep declarative network, so as to simultaneously recover consistent correspondences as well as motion segmentation. At the same time, by explicitly disentangling the correspondence and motion segmentation estimation modules, we achieve strong generalizability across different object categories. Our extensive evaluations demonstrate that our method is effective on various datasets ranging from rigid parts in articulated objects to individually moving objects in a 3D scene, be it single-view or full point clouds.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Motion analysis in dynamic point clouds is an emerging area, required by various applications such as surveillance, autonomous driving, and robotic manipulation. Our human-made environments are dominated by rigid body movements, ranging from articulated objects to solids like furniture or vehicles. These settings require us to address rigid motions of objects or object parts -which is often referred to as the multi-body motion estimation problem. Despite its importance, previous work has mainly focused on specific scenarios with known category semantics, like category-level articulated object segmentation <ref type="bibr">[41]</ref>, indoor scene instance relocalization <ref type="bibr">[65]</ref>, or car movement detection <ref type="bibr">[72]</ref>, leaving the literature of generic motion segmentation relatively unexplored. Different from traditional single scan analysis algorithms like semantic segmentation <ref type="bibr">[39]</ref>, the most challenging part in multi-body motion analysis is to disambiguate and distinguish rigid bodies. There, we are naturally required to jointly process and relate multiple inputs, to effectively find consistent motion-based part/object segmentations as well as point correspondences to enable a multi-way registration. It is even more challenging when the capture is not temporally dense, i.e., an intermittent acquisition that does not follow a stream such as a video, and might contain large pose variations, hampering naive temporal tracking.</p><p>In this paper, we introduce a multi-scan multi-body segmentation and motion estimation problem, where the goal is to simultaneously discover and register rigid bodies from multiple scans, represented either as full or partial point clouds, where objects come from unseen categories. As an effective solution, we present MultiBodySync, a fully end-to-end trainable deep declarative architecture <ref type="bibr">[23]</ref> able to process an arbitrary number of unordered point sets. As shown in Fig. <ref type="figure">1</ref>, given a set of scans, MultiBodySync begins relating pairs of scans via 3D scene flow <ref type="bibr">[78,</ref><ref type="bibr">64]</ref> and con-fidence estimation. Then, the following two differentiable (permutation and segmentation) synchronization modules, which are central to our approach, respectively enforce the consistency of pairwise point correspondences and motion segmentation labelings across different scans. Our design explicitly decouples geometry and motion, making Multi-BodySync generalizable to unseen categories without sacrificing robustness.</p><p>We evaluate MultiBodySync on various datasets composed of full synthetic point clouds and partial real scans with articulated and solid objects. We also contribute a new dataset DynLab with 8 scenes and 64 scan fragments of distinctly moving objects. Our extensive evaluations demonstrate that our algorithm outperforms the state-of-the-art by a large margin on both multi-body motion segmentation and motion estimation. In brief, our contributions are: 1. We introduce a novel end-to-end trainable architecture for solving the multi-scan multi-body motion estimation and segmentation problem. 2. We theoretically analyze the spectral characteristics of the proposed weighted permutation synchronization. 3. To the best of our knowledge, we showcase the first cross-category generalization for the task at hand on both synthetic and real datasets, for both articulated part-level and object-level regimes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>Dynamic scene understanding. The modeling of 3D dynamic scenes in deep learning literature is often formulated as a 4D data analysis, as done in seminal works like <ref type="bibr">[43,</ref><ref type="bibr">19]</ref>. Ability to infer spatiotemporal geometric properties has recently motivated research in 3D scene flow as a form of lowlevel dynamic scene representation <ref type="bibr">[42,</ref><ref type="bibr">60,</ref><ref type="bibr">69,</ref><ref type="bibr">51,</ref><ref type="bibr">47,</ref><ref type="bibr">54,</ref><ref type="bibr">44]</ref>. Domain-specific knowledge can be employed to give better predictions as done in autonomous driving <ref type="bibr">[29,</ref><ref type="bibr">5,</ref><ref type="bibr">72]</ref> or articulated object analysis <ref type="bibr">[77,</ref><ref type="bibr">68]</ref>. The most recent dynamic SLAM works <ref type="bibr">[31,</ref><ref type="bibr">7,</ref><ref type="bibr">81,</ref><ref type="bibr">75]</ref> also rely heavily on semantic cues. While some works <ref type="bibr">[47,</ref><ref type="bibr">54]</ref> advocates continuous temporal-dynamics modeling, we instead assume discrete non-sequential input and enforce consistency using synchronization. Similarly, <ref type="bibr">[26,</ref><ref type="bibr">65]</ref> propose to perform instance-level re-localization in a changed scene. Nevertheless, we do not assume a pre-segmentation of the scene, but instead perform joint motion segmentation.</p><p>Multi-body motion. Provided point correspondences between two point clouds/images, rigid-body motion segmentation becomes a multi-model fitting problem, amenable for factorization techniques <ref type="bibr">[20,</ref><ref type="bibr">40,</ref><ref type="bibr">76]</ref>, clustering <ref type="bibr">[32]</ref>, graph optimization <ref type="bibr">[45,</ref><ref type="bibr">35,</ref><ref type="bibr">11]</ref> or deep learning <ref type="bibr">[38]</ref>. Among others, <ref type="bibr">[78]</ref> handles raw scans and segments the rigidly moving parts using a Recurrent Neural Network (RNN). <ref type="bibr">[28]</ref> fits non-parametric part models to sequential 3D data without needing explicit correspondences. However, to our best knowledge, no prior work can handle multiple scans while enforcing multi-way consistency like we do.</p><p>Synchronization. The art of consistently recovering absolute quantities from a collection of ratios is now a basic component of the classical multi-view/shape analysis pipelines <ref type="bibr">[56,</ref><ref type="bibr">14,</ref><ref type="bibr">15]</ref>. Various aspects of the problem have been vastly studied: different group structures <ref type="bibr">[25,</ref><ref type="bibr">24,</ref><ref type="bibr">12,</ref><ref type="bibr">2,</ref><ref type="bibr">1,</ref><ref type="bibr">33,</ref><ref type="bibr">27,</ref><ref type="bibr">1,</ref><ref type="bibr">66,</ref><ref type="bibr">18,</ref><ref type="bibr">59,</ref><ref type="bibr">62,</ref><ref type="bibr">4,</ref><ref type="bibr">6]</ref>, closed-form solutions <ref type="bibr">[4,</ref><ref type="bibr">2,</ref><ref type="bibr">1]</ref>, robustness <ref type="bibr">[17]</ref>, certifiability <ref type="bibr">[55]</ref>, global optimality <ref type="bibr">[13]</ref>, learning-to-synchronize <ref type="bibr">[34,</ref><ref type="bibr">50,</ref><ref type="bibr">22]</ref> and uncertainty quantification <ref type="bibr">[61,</ref><ref type="bibr">10,</ref><ref type="bibr">9,</ref><ref type="bibr">12]</ref>. In this work, we are concerned with synchronizing correspondence sets, otherwise known as permutation synchronization (PS) <ref type="bibr">[48]</ref> and motion segmentations <ref type="bibr">[3]</ref>. PS is rich in the variety of algorithms: low-rank formulations <ref type="bibr">[80,</ref><ref type="bibr">67]</ref>, convex programming <ref type="bibr">[30]</ref>, distributed optimization <ref type="bibr">[30]</ref>, multi-graph matching <ref type="bibr">[57]</ref> or Riemannian optimization <ref type="bibr">[12]</ref>. Out of all those, we are interested in the spectral methods of <ref type="bibr">[2,</ref><ref type="bibr">46]</ref> as they provide efficient, closed-form solutions deployable within a deep declarative network <ref type="bibr">[23]</ref> like ours.</p><p>To the best of our knowledge, synchronization of correspondences <ref type="bibr">[46]</ref> or motion segmentation <ref type="bibr">[3]</ref> have not been explored in the context of deep learning. This is what we do in this paper to tackle the consistent multi-body motion estimation and segmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>Problem setting and notation. Suppose we observe a set of</p><p>where each point cloud X k = x k 1 , ..., x k i , ..., x k N contains N points in R 3 and sampled from the same object with S independently moving rigid parts indexed by s. Each point is assumed to belong to one of the S rigid parts and we denote the binary point-part association matrices as</p><p>i belongs to the s th rigid part and G k is = 0 otherwise<ref type="foot">foot_0</ref> . The rigid motions for each part s in each point cloud k is defined as</p><p>and the translational part being t k s &#8712; R 3 . Our final goal is to infer G and T given X . Summary. The core of our approach is a fully differentiable deep network fusing rigid dynamic information from multiple 3D scans as outlined in Fig. <ref type="figure">2</ref>. We begin by explicitly predicting pairwise soft correspondences across all pairs of point clouds while enforcing consistency via a weighted permutation synchronization ( &#167; 3.1). Next, the point clouds are segmented using a novel motion-based segmentation network and also further synchronized by a subsequent motion segmentation synchronization module ( &#167; 3.2). Finally, the correspondences and segmentations are  used to recover the 6-DoF transformation for each of the individual rigid parts. The whole procedure can be iterated to refine the results. The pipeline can be readily trained endto-end and we describe our training procedure in &#167; 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Flow Estimation and Synchronization</head><p>Our approach starts with point correspondence estimation between all K 2 pairs of point clouds. We tackle this problem by predicting a 3D scene flow</p><p>= X l holds up to a permutation. The architecture of &#981; flow inspired by Point PWC-Net <ref type="bibr">[73]</ref> is detailed in the supplementary material.</p><p>Flow signals, estimated in a pairwise fashion, are not informed about the multiview configuration at our disposal. To ensure multi-way consistent flows, we employ the weighted variant of permutation synchronization <ref type="bibr">[46]</ref> inspired by <ref type="bibr">[22,</ref><ref type="bibr">34]</ref> where a closed-form solution is given under spectral relaxation. We begin by the observation that any flow F kl would induce a soft assignment matrix P kl &#8712; M N &#215;N based on the nearest-neighbor distances:</p><p>where &#964; is the temperature of the softmax. The multinomial manifold M of row-stochastic matrices is a continuous relaxation of the (partial) permutation group P.</p><p>Outlier filtering. To take into account the noise, missing points, or errors in the network, we further associate a confidence value c kl i &#8712; R to each point x k i and its corresponding flow vector f kl i through another network &#981; conf (&#8226;) : R 7&#215;N &#8594; R N inspired from OANet <ref type="bibr">[82]</ref>. The input to this network are the tuples {(x k i ,</p><p>i=1 and we provide the architectural details in the supplementary. The last dimension of this tuple measures the quality of the flow vector via the distance between the transformed points and their nearest neighbors, thereby detect-ing spurious flow predictions. The final w kl in Eq (3) reflects the overall quality of the corresponding P kl . Here we choose w kl as the average confidence of all points, i.e., w kl = N i=1 c kl i /N . Consistent correspondences. We now use the predictions {P kl , w kl } (k,l) to achieve multiview consistent assignments. To this end, we deploy a differentiable synchronization algorithm inspired by <ref type="bibr">[46]</ref>. We first introduce absolute permutation matrices P k which map each point in X k to a universe space and stack them as p = [. . . , (P k ) &#8868; , . . . ] &#8868; . We solve for the best p minimizing:</p><p>Theorem 1 (Weighted synchronization). The spectral solution to the weighted synchronization problem in Eq (2)</p><p>p is given by the N eigenvectors of L corresponding to the smallest N eigenvalues, where L &#8712; R KN &#215;KN is the weighted Graph Connection Laplacian (GCL) constructed by tiling all P kl matrices weighted by the related w kl :</p><p>Proof. Please refer to the supplementary material.</p><p>This spectral solution requires only an eigendecomposition lending itself to easy differentiation <ref type="bibr">[34,</ref><ref type="bibr">22]</ref>. The synchronized soft correspondence Pkl is then extracted as the (k, l)-th N &#215; N block of pp &#8868; . As a consequence of the relaxation, we cannot ensure that each sub-matrix of pp &#8868; would be a valid permutation. To preserve differentiability we avoid Hungarian-like projection operators <ref type="bibr">[46]</ref> and propose to directly compute the induced flow Fkl = ..., f kl i , ... using a softmax normalization on the synchronized soft correspondences:</p><p>Intuitively, this amounts to using the normalized synchronized result as a soft-assignment matrix, diminishing the effect of non-corresponding matches (false positives).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Motion Segmentation</head><p>Based upon the multiview consistent flow output Fkl , we now predict the point-part associations G. Since we are not provided with consistent labeling of the parts, instead of predicting G k directly, we estimate for all K 2 point cloud pairs a relative motion segmentation matrix &#950; kl &#8712; [0, 1] N &#215;N , where &#950; kl ij is 1 when x k i and x l j belong to the same rigid body, and 0 otherwise.</p><p>Our motion segmentation network &#981; mot (&#8226;) : R 12&#215;N &#8594; R N &#215;N illustrated in Fig. <ref type="figure">3</ref> takes the point cloud pair X k , X l as well as flow Fkl , Flk estimated from the last step as input and outputs the matrix &#950; kl . It begins with a PointNet++ <ref type="bibr">[53]</ref> predicting a transformation Tkl i for each point in x k i &#8712; X k2 . The predictions map the part in X k containing x k i to X l . We then compute a residual matrix &#946; kl &#8712; R 3&#215;N &#215;N based on Tkl i , whose element is:</p><p>where &#8226; denotes the action of T. One can easily verify that the smaller the norm of the (i, j)-th entry of &#946; kl is, the more likely that x k i and x l j are in the same rigid part. Therefore, it contains valuable information for deducing the motion segmentation &#950; kl . We apply N denoising mini-PointNet <ref type="bibr">[52]</ref> &#981; mlp (&#8226;) to each horizontal 3 &#215; N slice of &#946; kl , concatenated with X l to get a likelihood score for each pair of points (x k i &#8712; X k , x l j &#8712; X l ). The network output &#950; kl net is subsequently computed by applying a sigmoid on the output:</p><p>Motion segmentation consistency. Given all pairwise motion information &#950; kl , we adopt the method of Arrigoni and Pajdla <ref type="bibr">[3]</ref> to compute an absolute motion segmentation g &#8712; R KN &#215;S as a stack of matrices in G. Once again, this is an instance of a synchronization problem, with the stacked relative and absolute motion segmentation matrices being:</p><p>A spectral approach similar to the one in &#167; 3.1 optimizes for g so that Z = gg &#8868; is best satisfied. Then, g is just the S leading eigenvectors of Z, scaled by the square root of its S largest eigenvalues. Here, the point-part association matrices G k are relaxed to fuzzy segmentations by allowing its entries to take real values. As a subsequent step similar to &#167; 3.1, we replace the projection step with a row-wise softmax on g to maintain differentiability. Note that the output &#950; kl net of &#981; mot is unnormalized, meaning that any submatrix in Z can be written as &#950; kl = &#963; kl &#950; kl net , where &#963; acts as a normalizer. This is akin to encoding a confidence in the norm of the matrix &#950; kl net and requires us to solve a weighted synchronization. However, as we prove in the following theorem, such a solution would involve an anisotropic scaling in the eigenvectors as a function of the number of points belonging to each part. As this piece of information is not available in runtime, we take an alternative approach and approximate the scaling factor as q kl = mean(&#950; kl net ) and pre-factor it out of &#950; kl net , by letting &#950; kl = &#950; kl net /q kl . In this way, we ensure that the eigenvectors yield the synchronized motion segmentation.</p><p>Theorem 2. Under mild assumptions, the solution to the segmentation synchronization problem using a nonuniformly weighted matrix will result in a proportionally scaled version of the solution obtained by the eigenvectors of the unweighted matrix Z.</p><p>Proof. Please refer to the supplementary material.</p><p>As we show in our supplement, entry k in the decomposed eigenvalues is related to the number of points belonging to motion k. To compute the number of rigid bodies S, i.e., determine how many eigenvectors to use in g, the spectrum of Z is analyzed during test time: We estimate S as the number of eigenvalues that are larger than &#945;-percent of the sum of the first 10 eigenvalues. For training, we just fix S = 6 as an over-parametrization.</p><p>Pose Computation and Iterative Refinement. We finally estimate the motion for each part using a weighted Kabsch algorithm <ref type="bibr">[37,</ref><ref type="bibr">22]</ref> followed by a joint pose estimation. During test time we also iterate our pipeline several times to gradually refine the correspondence and segmentation estimation by transforming input point clouds according to the estimated T and adding back the residual flow onto the flow predicted at the previous iteration. The details are provided in our supplementary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Network Training</head><p>We propose to train each learnable component of our pipeline separately in a pairwise manner and then fine-tune their parameters using the full pipeline. Specifically, we first train the flow estimation network &#981; flow supervised with ground-truth flow: L kl flow = F kl -F kl,gt 2 F . Given the trained &#981; flow , the confidence estimation network &#981; conf is trained based on its output using a binary cross-entropy (BCE) loss supervised by comparing whether the error of the predicted flow is under a certain threshold:</p><p>with c kl,gt i = 1 if we have f kl if kl,gt i 2 2 &lt; &#491; f and 0 otherwise. The motion segmentation network &#981; mot is trained using joint supervision over the estimated transformation residual and the final motion segmentation matrix: L kl seg = L kl trans + L kl group where each term is defined as:</p><p>After we train all the networks (i.e., &#981; flow , &#981; conf and &#981; mot ), the entire pipeline is trained end-to-end with the supervision on both the pariwise flow K k=1 K l=1 L kl flow and the IoU (Intersection-over-union) loss, defined as:</p><p>where A is an S &#215; S binary assignment matrix which we found using the Hungarian algorithm. The flow supervision is added to both the output of flow network, and the final pairwise rigid flow computed as</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>Datasets. Our algorithm is tested on two main datasets: SAPIEN <ref type="bibr">[74]</ref> and DynLab dataset contributed by this work: SAPIEN consists of realistic simulated articulated models with part mobility annotated. We ensure that the categories used for training and validation do not overlap with the test set, finally leading to 720 articulated objects with 20 dif- ferent categories. We then perform K virtual 3D scans of the models, with each scan capturing the same object with a different camera (and hence object) pose and object articulating state. Later, furthest point sampling is applied to down-sample the number of points to N . DynLab (Dynamic Laboratory) contains 8 different scenes in a laboratory, each with 2-3 rigidly moving solid objects from various categories. Each of the scenes is captured 8 times, reconstructed using ElasticFusion <ref type="bibr">[71]</ref> and between each capture, the object positions are randomly changed. The dataset also contains manual annotations of the object segmentation mask and rigid absolute transformations. For benchmarking, in each scene we choose different combinations of the 8 captures, leading to a total of 8 &#8226; 8 4 = 560 dataset items. We believe the two different scenarios (articulated single object and moving rigid bodies) reflected in the test sets are sufficient to verify the robustness and the general applicability of our algorithm.</p><p>The training data for articulated objects are generated using the dataset from <ref type="bibr">[79]</ref>, containing manually annotated semantic segmentation of 16 categories. Similar to <ref type="bibr">[78]</ref>, we generate K random motions for each connected semantic part of the shapes. For the training data of solid objects, we randomly sample independent motions for multiple objects taken from ShapeNet <ref type="bibr">[16]</ref> as if they are floating and rotating in the air. Please refer to supplementary material for detailed data specifications and visualizations.</p><p>Metrics. Two main metrics are used: (1) EPE3D (End-Point Error in 3D) of all K 2 pairs of point clouds. The mean and standard deviation (+/-) measures the rigid 3D flow estimation quality: While the mean reflects an overall error in the transformation, the standard deviation shows how consistent the estimate is among all pairs -a desirable property in the multi-scan setting. (2) Segmentation accuracy assesses the motion segmentation quality. We use mIoU (mean Intersection-over-Union) and RI (Rand Index) to score the output based on 'Multi-Scan' and 'Per-Scan' segmentations. For 'Multi-Scan', we evaluate the points from all K clouds altogether, revealing the consistency of the labeling across multiple scans. For 'Per-Scan', we compute the score for each of the clouds separately and evaluate the mean and standard deviation across all scans.</p><p>Training. &#981; flow , &#981; mot and &#981; conf are trained using Adam optimizer with initial learning rate of 10 -3 and a 0.5/0.7/0.7 decay every 400K iterations for the three networks. The   batch sizes are set to 32/8/32, respectively. The entire pipeline is trained end-to-end using K = 4 point clouds, with a learning rate of 10 -6 . The gradient computation for eigen-decomposition will sometimes lead to numerical instabilities <ref type="bibr">[21]</ref>, so we roll back that iteration when the gradient norm is large. Our algorithm is implemented using PyTorch <ref type="bibr">[49]</ref> with N = 512, &#964; = 0.01, &#491; f = 0.1. We set &#945; = 0.05 for articulated objects and &#945; = 0.15 for solid objects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Results on Articulated Objects</head><p>Baselines. Given our new multi-scan multi-body setting, we made adaptations to previous methods and compared to the following 4 baselines: Flow Accuracy. Tab. 1 shows that despite being based on <ref type="bibr">[78]</ref>, our method gives the lowest flow error and variance across different view pairs. This is thanks to the correspondence consistency among the provided K scans enforced by our synchronization module. The NPP method suffers from a surprisingly high flow error mainly because the point-level correspondence is not explicitly modeled. Note that Point-Net++ and MeteorNet are excluded because they only output point-wise segmentations.</p><p>Segmentation Accuracy. For the segmentation benchmark, we achieve a significantly better result than all the baselines as shown in Tab. 2. Among the baselines, Me- One important aspect of our network is that it can generalize to different objects and motions without re-training. To qualitatively showcase this, we use two additional dynamic RGB-D sequences from <ref type="bibr">[63]</ref> and <ref type="bibr">[58]</ref>. For each sequence, we use four views and back-project the depth map into point clouds for inference. As shown in Fig. <ref type="figure">5</ref>, our model trained on full objects of synthetic SAPIEN dataset, can generalize to real dynamic depth sequences producing consistent motion-based segmentation. This is possible thanks to the property that our network anchors on the motion and not on the specific geometry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Results on Full Objects</head><p>In DynLab, each rigid body (i.e. object) is now semantically meaningful, so apart from the 4 baseline methods from &#167; 4.1, we additionally compare to the following two alternatives: (5) InstSeg (Instance Segmentation): We take the state-of-the-art indoor semantic instance module Point-Group <ref type="bibr">[36]</ref> trained on ScanNet dataset to segment for each input cloud. ( <ref type="formula">6</ref>) Geometric: We use the Ward-linkage <ref type="bibr">[70]</ref> to agglomeratively cluster the points in each scan. In order to obtain consistent segmentation across multiple inputs, we  associate the segmentations between two different scans using a Hungarian search over the object assignment matrix, whose element is the root mean squared error measuring the fitting quality between any combinations of the object associations.</p><p>Interestingly, as listed in Tab. 3, all the previous deep methods lead to unsatisfactory results on this dataset. Point-Net++ and MeteorNet are found to be inaccurate because by design they associate labels in the level of semantics (not motion) and no explicit consistencies across scans are considered. Even though the InstSeg method is trained on large-scale scene dataset, it is impossible for it to cover all real-world categories so wrong detections are observed in some scenes. The geometric approach is less robust in cluttered scenes where no obvious geometric cues can be used. Our method is motion-induced and is hence robust to geometric variations and out-of-distribution semantics, outperforming all baselines. A typical failure scenario for these approaches is visualized in Fig. <ref type="figure">6</ref>. We show additional qualitative results in Fig. <ref type="figure">7</ref>, demonstrating our ability to accurately segment, associate, and compute correct object transformations even if there are large pose changes.</p><p>Tab. 4 shows the rigid flow estimation result against the baselines. Apart from the influence of wrong per-scan segmentation and cross-scan associations, the iterative closest point (ICP) <ref type="bibr">[8]</ref> method used to register object scans can also suffer from poor initializations. Our approach not only  reaches the lowest mean error, but also respects the motion consistency across multiple scans.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Ablation Study and System Analysis</head><p>Effect of synchronization. For permutation synchronization ( &#167; 3.1), we can directly feed the network-predicted flow vector F kl to subsequent steps instead of using synchronized Fkl (Ours: NS, NW), or use an unweighted version of the synchronization by setting all w kl = 1 (Ours: S, NW). However, as shown quantitatively in Tab. 1, both variants result in higher flow error due to the failure to find consistent correspondences. Similar results can be observed on DynLab dataset as demonstrated in the two sub-figures of Fig. <ref type="figure">6</ref>, where direct flow prediction failed because the geometric variation is too large between two scans.</p><p>Effect of K. Our method can be naturally applied to an arbitrary number of views K even if we train using 4 views, because by design the learnable parameters are unaware of the input counts. As shown in Fig. <ref type="figure">9</ref>, the segmentation accuracy improves given more views. This is because the introduction of additional scans helps build the connection between existing scans and benefits the 'co-segmentation' process.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Number of iterations.</head><p>As pointed out in &#167; 3.2, our pipeline can be run multiple iterations to refine the results and an example is given in Fig. <ref type="figure">8</ref>. Shown in Fig. <ref type="figure">9</ref>, our method works better with more iterations because we estimate more accurate flows. Moreover, more iterations are demonstrated to be unnecessary because previous iterations already lead to converged results. Timing. Our experiments are conducted using an Nvidia GeForce GTX 1080 card. For the input of 4 scans, the running time of our full model is &#8764;870ms per iteration. The entirety of a 4-iteration scheme hence takes &#8764;3.5s, while <ref type="bibr">[78]</ref> and <ref type="bibr">[28]</ref> take 11.5s and 60s resp. in comparison.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>We presented MultiBodySync, a pipeline for simultaneously segmenting and registering multiple dynamic scans with multiple rigid bodies. We, for the first time, incorporated weighted permutation synchronization and motion segmentation synchronization into a fully-differentiable pipeline for generating consistent results across all input point clouds. However, currently MultiBodySync is not scalable to a large number (like hundreds) of scans or rigid bodies. Future directions include improvement of the pipeline's scalability and robustness in more complicated and dynamic settings.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Throughout our paper we use superscript k, l to index point-clouds, subscript i, j to index points and subscript s to index rigid parts.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>In practice, instead of predicting Tkl i directly, we estimate a residual motion w.r.t. the already obtained flow vectors similar to the method in<ref type="bibr">[78]</ref>. This procedure is detailed in our supplementary material.</p></note>
		</body>
		</text>
</TEI>
