<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10431858</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>M. Liu</author><author>Y. Zhu</author><author>H. Cai</author><author>S. Han</author><author>Z. Ling</author><author>F. Porikli</author><author>H. and Su</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Figure 1. We propose PartSLIP, a zero/few-shot method for 3D point cloud part segmentation by leveraging pretrained image-language models. The figure shows text prompts and corresponding semantic segmentation results (zoom in for details). Our method also supports part-level instance segmentation. See Figure 5 and Figure 7 for more results.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Human visual perception can parse objects into parts and generalize to unseen objects, which is crucial for under-standing their structure, semantics, mobility, and functionality. 3D part segmentation plays a critical role in empowering machines with such ability and facilitates a wide range of applications, such as robotic manipulation, AR/VR, and shape analysis and synthesis <ref type="bibr">[2,</ref><ref type="bibr">31,</ref><ref type="bibr">39,</ref><ref type="bibr">69]</ref>.</p><p>Recent part-annotated 3D shape datasets <ref type="bibr">[40,</ref><ref type="bibr">67,</ref><ref type="bibr">72]</ref> have promoted advances in designing various data-driven approaches for 3D part segmentation <ref type="bibr">[34,</ref><ref type="bibr">44,</ref><ref type="bibr">65,</ref><ref type="bibr">73]</ref>. While standard supervised training enables these methods to achieve remarkable results, they often struggle with outof-distribution test shapes (e.g., unseen classes). However, compared to image datasets, these 3D part-annotated datasets are still orders of magnitude smaller in scale, since building 3D models and annotating fine-grained 3D object parts are laborious and time-consuming. It is thus challenging to provide sufficient training data covering all object categories. For example, the recent PartNet dataset <ref type="bibr">[40]</ref> contains only 24 object categories, far less than what an intelligent agent would encounter in the real world.</p><p>To design a generalizable 3D part segmentation module, many recent works have focused on the few-shot setting, assuming only a few 3D shapes of each category during training. They design various strategies to learn better representations, and complement vanilla supervised learning <ref type="bibr">[33,</ref><ref type="bibr">53,</ref><ref type="bibr">54,</ref><ref type="bibr">60,</ref><ref type="bibr">80]</ref>. While they show improvements over the original pipeline, there is still a large gap between what these models can do and what downstream applica-This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.</p><p>Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore. tions need. The problem of generalizable 3D part segmentation is still far from being solved. Another parallel line of work focuses on learning the concept of universal object parts and decomposing a 3D shape into a set of (hierarchical) fine-grained parts <ref type="bibr">[37,</ref><ref type="bibr">64,</ref><ref type="bibr">74]</ref>. However, these works do not consider the semantic labeling of parts and may be limited in practical use.</p><p>In this paper, we seek to solve the low-shot (zero-and few-shot) 3D part segmentation problem by leveraging pretrained image-language models, inspired by their recent striking performances in low-shot learning. By pretraining on large-scale image-text pairs, image-language models <ref type="bibr">[1,</ref><ref type="bibr">22,</ref><ref type="bibr">29,</ref><ref type="bibr">45,</ref><ref type="bibr">46,</ref><ref type="bibr">50,</ref><ref type="bibr">76]</ref> learn a wide range of visual concepts and knowledge, which can be referenced by natural language. Thanks to their impressive zero-shot capabilities, they have already enabled a variety of 2D/3D vision and language tasks <ref type="bibr">[10,</ref><ref type="bibr">16,</ref><ref type="bibr">20,</ref><ref type="bibr">47,</ref><ref type="bibr">49,</ref><ref type="bibr">51,</ref><ref type="bibr">77]</ref>.</p><p>As shown in Figure <ref type="figure">1</ref>, our method takes a 3D point cloud and a text prompt as input, and generates both 3D semantic and instance segmentations in a zero-shot or few-shot fashion. Specifically, we integrate the GLIP <ref type="bibr">[29]</ref> model, which is pretrained on 2D visual grounding and detection tasks with over 27M image-text pairs and has a strong capability to recognize object parts. To connect our 3D input with the 2D GLIP model, we render multi-view 2D images for the point cloud, which are then fed into the GLIP model together with a text prompt containing part names of interest. The GLIP model then detects parts of interest for each 2D view and outputs detection results in the form of 2D bounding boxes. Since it is non-trivial to convert 2D boxes back to 3D, we propose a novel 3D voting and grouping module to fuse the multi-view 2D bounding boxes and generate 3D instance segmentation for the input point cloud. Also, the pretrained GLIP model may not fully understand our definition of parts only through text prompts. We find that an effective solution is prompt tuning with few-shot segmented 3D shapes. In prompt tuning, we learn an offset feature vector for the language embedding of each part name while fixing the parameters of the pretrained GLIP model. Moreover, we propose a multi-view visual feature aggregation module to fuse the information of multiple 2D views, so that the GLIP model can have a better global understanding of the input 3D shape instead of predicting bounding boxes from each isolated 2D view.</p><p>To better understand the generalizability of various approaches and their performances in low-shot settings, we propose a benchmark PartNet-Ensembled (PartNetE) by incorporating two existing datasets PartNet <ref type="bibr">[40]</ref> and Part-NetMobility <ref type="bibr">[67]</ref>. Through extensive evaluation on Part-NetE, we show that our method enables excellent zero-shot 3D part segmentation. With few-shot prompt tuning, our method not only outperforms existing few-shot approaches by a large margin but also achieves highly competitive per-formance compared to the fully supervised counterpart. We also demonstrate that our method can be directly applied to iPhone-scanned point clouds without significant domain gaps. In summary, our contributions mainly include:</p><p>&#8226; We introduce a novel 3D part segmentation method that leverages pretrained image-language models and achieves outstanding zero-shot and few-shot performance. &#8226; We present a 3D voting and grouping module, which effectively converts multi-view 2D bounding boxes into 3D semantic and instance segmentation. &#8226; We utilize few-shot prompt tuning and multi-view feature aggregation to boost GLIP's detection performance. &#8226; We propose a benchmark PartNetE that benefits future work on low-shot and text-driven 3D part segmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">3D Part Segmentation</head><p>3D part segmentation involves two main tasks: semantic segmentation and instance segmentation. Most 3D backbone networks <ref type="bibr">[43,</ref><ref type="bibr">44,</ref><ref type="bibr">56,</ref><ref type="bibr">65]</ref> are capable of semantic segmentation by predicting a semantic label for each geometric primitive (e.g., point or voxel). Existing learning-based approaches solve instance segmentation by incorporating various grouping <ref type="bibr">[9,</ref><ref type="bibr">15,</ref><ref type="bibr">23,</ref><ref type="bibr">30,</ref><ref type="bibr">58,</ref><ref type="bibr">62,</ref><ref type="bibr">63,</ref><ref type="bibr">75]</ref> or region proposal <ref type="bibr">[17,</ref><ref type="bibr">70,</ref><ref type="bibr">73]</ref> strategies into the pipeline. Different from standard training with per-point part labels, some works leverage weak supervision, such as bounding box <ref type="bibr">[8,</ref><ref type="bibr">35]</ref>, language reference game <ref type="bibr">[26]</ref>, or IKEA manual <ref type="bibr">[61]</ref>. Instead of focusing on single objects, <ref type="bibr">[4,</ref><ref type="bibr">42]</ref> also consider part segmentation for scene-scale input. Moreover, unlike the two classical tasks of semantic and instance segmentation, another parallel line of works decomposes a 3D shape into a set of (hierarchical) fine-grained parts but without considering semantic labels <ref type="bibr">[37,</ref><ref type="bibr">64,</ref><ref type="bibr">74]</ref>, which differs from our objective. Recently, some works also propose to learn a continuous implicit semantic field <ref type="bibr">[25,</ref><ref type="bibr">81]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Data-Efficient 3D Segmentation</head><p>In order to train a generalizable 3D part segmentation network with low-shot data, many existing efforts focus on leveraging various pretext tasks and auxiliary losses <ref type="bibr">[3,</ref><ref type="bibr">12,</ref><ref type="bibr">14,</ref><ref type="bibr">52,</ref><ref type="bibr">55]</ref>. In addition, <ref type="bibr">[13,</ref><ref type="bibr">41]</ref> studies the compositional generalization of 3D parts. <ref type="bibr">[60]</ref> deforms input shapes to align with few-shot template shapes. <ref type="bibr">[53]</ref> leverages 2D contrastive learning by projecting 3D shapes and learning dense multi-view correspondences. <ref type="bibr">[7]</ref> leverages branched autoencoders to co-segment a collection of shapes. Also, some works aim to learn better representations by utilizing prototype learning <ref type="bibr">[80]</ref>, reinforcement learning <ref type="bibr">[33]</ref>, and data augmentation <ref type="bibr">[54]</ref>. Moreover, there is a line of work investigating label-efficient 3D segmentation <ref type="bibr">[18,</ref><ref type="bibr">32,</ref><ref type="bibr">36,</ref><ref type="bibr">68,</ref><ref type="bibr">71,</ref><ref type="bibr">78,</ref><ref type="bibr">78,</ref><ref type="bibr">79]</ref>, assuming a small portion of training data is annotated (e.g., 0.1% point labels). While the setting may be useful in indoor and autonomous driving scenarios, it is not aligned with our goal since the number of training shapes is already limited in our setup.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">3D Learning with Image-Language Models</head><p>Pretrained image-language models have recently made great strides by pretraining on large-scale image-text pairs <ref type="bibr">[1,</ref><ref type="bibr">22,</ref><ref type="bibr">29,</ref><ref type="bibr">45,</ref><ref type="bibr">46,</ref><ref type="bibr">50,</ref><ref type="bibr">76]</ref>. Due to their learned rich visual concepts and impressive zero-shot capabilities, they have been applied to a wide range of 3D vision tasks, such as 3D avatar generation and manipulation <ref type="bibr">[5,</ref><ref type="bibr">16,</ref><ref type="bibr">21]</ref>, general 3D shape generation <ref type="bibr">[19,</ref><ref type="bibr">24,</ref><ref type="bibr">38,</ref><ref type="bibr">51]</ref>, low-shot 3D shape classification <ref type="bibr">[77]</ref>, neural radiance fields <ref type="bibr">[20,</ref><ref type="bibr">59]</ref>, 3D visual grounding <ref type="bibr">[10,</ref><ref type="bibr">57]</ref>, and 3D representation learning <ref type="bibr">[49]</ref>. To the best of our knowledge, we are one of the first to utilize pretrained image-language models to help with the task of 3D part segmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Proposed Method: PartSLIP</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Overview: 3D Part Segmentation with GLIP</head><p>We aim to solve both semantic and instance segmentation for 3D object parts by leveraging pretrained imagelanguage models (ILMs). There are various large-scale ILMs emerged in the past few years. In order to enable generalizable 3D object part segmentation, the pre-trained ILM is expected to be capable of generating region-level output (e.g., 2D segmentation or 2D bounding boxes) and recognizing object parts. After comparing several released pretrained ILMs (e.g., CLIP <ref type="bibr">[45]</ref>), we find that the GLIP <ref type="bibr">[29]</ref> model is a good choice. The GLIP <ref type="bibr">[29]</ref> model focuses on 2D visual grounding and detection tasks. It takes as input a free-form text description and a 2D image, and locates all phrases of the text by outputting multiple 2D bounding boxes for the input image. By pretraining on largescale image-text pairs (e.g., 27M grounding data), the GLIP model learns a wide range of visual concepts (e.g., object parts) and enables open-vocabulary 2D detection.</p><p>Figure <ref type="figure">2</ref> shows our overall pipeline, where we take a 3D point cloud as input. Here, we consider point clouds from unprojecting and fusing multiple RGB-D images, which is a common setup in real-world applications and leads to dense points with color and normal. To connect the 2D GLIP model with our 3D point cloud input, we render the point cloud from K predefined camera poses. The camera poses are uniformly spaced around the input point cloud, aiming to cover all regions of the shape. Since we assume a dense and colored point cloud input<ref type="foot">foot_0</ref> , we render the point cloud by simple rasterization without introducing significant artifacts. The K rendered images are then fed separately into the pretrained GLIP model along with a text prompt. We format the text prompt by concatenating all part names of interest and the object category. For example, for a chair point cloud, the text prompt could be "arm, back, seat, leg, wheel of a chair". Please note that unlike the traditional segmentation networks, which are limited to a closed set of part categories, our method is more flexible and can include any part name in the text prompt. For each 2D rendered image, the GLIP model is expected to predict multiple bounding boxes, based on the text prompt, for all part instances that appear. We then fuse all bounding boxes from K views into 3D to generate semantic and instance segmentation for the input point cloud (Section 3.2).</p><p>The above pipeline introduces an intuitive zero-shot approach for 3D part segmentation without requiring any 3D training. However, its performance may be limited by the GLIP predictions. We thus propose two additional components, which could be incorporated into the above pipeline to encourage more accurate GLIP prediction: (a) prompt tuning with few-shot 3D data, which enables the GLIP model to quickly adapt to the meaning of each part name (Section 3.3); (b) multi-view feature aggregation, which allows the GLIP model to have a more comprehensive visual understanding of the input 3D shape (Section 3.4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Detected 2D BBoxes to 3D Point Segmentation</head><p>Although the correspondence between 2D pixels and 3D points are available, there are still two main challenges when converting the detected 2D bounding boxes to 3D point segmentation. First, bounding boxes are not as precise as point-wise labels. A 2D bounding box may cover points from other part instances as well. Also, although each bounding box may indicate a part instance, we are not provided with their relations across views. It's not very straightforward to determine which sets of 2D bounding boxes indicate the same 3D part instance.</p><p>Therefore, we propose a learning-free module to convert the GLIP predictions to 3D point segmentation, which mainly includes three steps: (a) oversegment the input point cloud into a collection of super points; (b) assign a semantic label for each super point by 3D voting; and (c) group super points within each part category into instances based on their similarity of bounding box coverage. 3D Super Point Generation: We follow the method in <ref type="bibr">[28]</ref> to oversegment the input point cloud into a collection of super points. Specifically, we utilize point normal and color as features and solve a generalized minimal partition problem with an l 0 -cut pursuit algorithm <ref type="bibr">[27]</ref>. Since points in each generated super point share similar geometry and appearance, we assume they belong to one part instance. The super point partition serves as an important 3D prior when assigning semantic and instance labels. It also speeds up the label assignment, as the number of super points is orders of magnitude smaller than the number of 3D points. 3D Semantic Voting: While a single bounding box may cover irrelevant points from other parts, we want to leverage information from multiple views and the super point partition to counteract the effect of irrelevant points. Specifically, for each pair of super point and part category, we calculate a score s i,j measuring the proportion of the ith super point covered by any bounding box of part category j:</p><p>where Note that for each view, we only consider visible points since bounding boxes only contain visible portions of each part instance. Both VIS k (p) and INS b (p) can be computed based on the information from point cloud rasterization. After that, for each super point i, we assign part category j with the highest score s i,j to be its semantic label. 3D Instance Grouping: In order to group the super points into part instances, we first regard each super point as an individual instance and then consider whether to merge each pair of super points. For a pair of super points SP u and SP v , we merge them if: (a) they have the same semantic label, (b) they are adjacent in 3D, and (c) for each bounding box, they are either both included or both excluded. Specifically, for the second criterion, we find the k nearest neighbors for all points within each super point. If any point in SP v is among the k nearest neighbors of a point in SP u , or vice versa, we consider the super points to be adjacent. For the third criterion, we consider bounding boxes from views where both of them are visible:</p><p>where VIS k (SP u ) indicates whether the super point SP u can be (partially) visible in view k and BB k indicates all predicted bounding boxes of view k. Suppose B contains n bounding boxes. We then construct two n dimensional vectors I u and I v , describing the bounding box coverage of SP u and SP v . Specifically, I u [i] is calculated as:</p><p>where After checking all pairs of super points, the super points are divided into multiple connected components, each of which is then considered to be a part instance. We found that our super point-based module works well in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Prompt Tuning w/ Few-Shot 3D Data</head><p>In our method, we utilize natural language to refer to a part. However, natural language can be flexible. An object part can be named in multiple ways (e.g., spout and mouth for kettles; caster and wheel for chairs), and the definition of some parts may be ambiguous (see the dispenser in Figure <ref type="figure">1</ref>). We thus hope to finetune the GLIP model using a few 3D shapes with ground truth part segmentation, so that the GLIP model can quickly adapt to the actual definition of the part names in the text prompt.</p><p>Figure <ref type="figure">3</ref> shows the overall architecture of the GLIP model. It first employs a language encoder and an image encoder to extract language features and multi-scale visual features, respectively, which are then fed into a visionlanguage fusion module to fuse information across modalities. The detection head then takes as input the languageaware image features and predicts 2D bounding boxes. During pretraining, the GLIP network is supervised by both detection loss and image-language alignment loss.</p><p>It is not desirable to change the parameters of the visual module or the entire GLIP model since our goal is to leverage only a few 3D shapes for finetuning. Instead, we follow the prompt tuning strategy introduced in GLIP <ref type="bibr">[29]</ref> to finetune only the language embedding of each part name while freezing the parameters of the pretrained GLIP model. Specifically, we perform prompt tuning for each object category separately. Suppose the input text of an object category includes l tokens and denote the extracted language features (before VL fusion) as f l &#8712; R l&#215;c , where c is the number of channels. We aim to learn offset features f o &#8712; R l&#215;c for f l and feed their summation f l + f o to the remaining GLIP pipeline. The offset features f o consist of constant vectors for each token (part name), which can be interpreted as a local adjustment of the part definition in the language embedding space. Note that f o is not predicted by a network but is directly optimized as a trainable variable during prompt tuning. Also, f o will be fixed for each object category after prompt tuning.</p><p>In order to utilize the detection and alignment losses for optimization, we convert the few-shot 3D shapes with ground truth instance segmentation into 2D images with bounding boxes. Specifically, for each 3D point cloud, we render K 2D images from the predefined camera poses. For generating corresponding 2D ground-truth bounding boxes, we project each part instance from 3D to 2D. Note that, after projection, we need to remove occluded points (i.e., invisible points of each view) and noisy points (i.e., visible but isolated in tiny regions) to generate reasonable bounding boxes. We find that by prompt tuning with only one or a few 3D shapes, the GLIP model can quickly adapt to our part definitions and generalize to other instances.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Multi-View Visual Feature Aggregation</head><p>The GLIP model is sensitive to camera views. For example, images taken from some unfamiliar views (e.g., the rear view of a cabinet) can be uninformative and confusing, making it difficult for the GLIP model to predict accurately. However, unlike regular 2D recognition tasks, our input is a 3D point cloud, and there are pixel-wise correspondences between different 2D views. Therefore, we hope the GLIP model can leverage these 3D priors to make better predictions instead of focusing on each view in isolation.</p><p>In order to take full advantage of the pretrained GLIP model, we propose a training-free multi-view visual feature aggregation module that could be plugged into the original GLIP network without changing any existing network weights. Specifically, the feature aggregation module takes K feature maps {f k &#8712; R m&#215;m&#215;c } as input, where m is the spatial resolution of the feature map and c is the number of channels. The input feature maps {f k } are generated by the GLIP module separately for each 2D view of the input point cloud. Our feature aggregation module fuses them and generates K fused feature maps {f k } of the same shape, which are then used to replace the original feature maps and fed into the remaining layers of the GLIP model.</p><p>As shown in Figure <ref type="figure">4</ref>, for each cell (u, v) of feature map f i , we find its corresponding cell (u i&#8594;k , v i&#8594;k ) in each feature map f k and use their weighted average to serve as the fused feature of the cell:</p><p>Specifically, we define P i (u, v) as the set of 3D points that are visible in view i and whose projections lie within cell (u, v). We then choose the cell in view k with the most overlapping 3D points as the corresponding cell:</p><p>. Note that if all 3D points in P i (u, v) are not visible in a view k, then feature map f k will not contribute to f i <ref type="bibr">[u, v]</ref>. Since the GLIP model generates multi-scale visual features, our aggregation module fuses features of each scale level separately.</p><p>There are various options for which visual features to fuse (see Figure <ref type="figure">3</ref>). One intuitive choice is to fuse the final visual features before the detection head, and we denote this choice as late fusion. We find that the late fusion does not improve or even degrade the original performance. This is mainly because the final visual features contain too much shape information of the predicted 2D bounding boxes. Directly averaging the final visual features can somehow be seen as averaging bounding boxes in 2D, which does not make sense. Instead, we choose to fuse the visual features before the vision-language fusion (denoted as early fusion).</p><p>Since the text prompt is not involved yet, the visual features mainly describe the geometry and appearance of the input shape. Fusing these features across views with the 3D priors can thus lead to a more comprehensive visual understanding of the input shape.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments 4.1. Datasets and Metrics</head><p>To evaluate the generalizability of various approaches and their performances in the low-shot setting, we curate an ensembled dataset named PartNet-Ensembled (PartNetE), which consists of shapes from existing datasets PartNet <ref type="bibr">[40]</ref> and PartNet-Mobility <ref type="bibr">[67]</ref>. Note that PartNet-Mobility contains more object categories but fewer shape instances, and PartNet contains more shape instances but fewer object categories. We thus utilize shapes from PartNet-Mobility for few-shot learning and test, and use shapes from PartNet to serve as additional large-scale training data for transfer learning. As a result, the test set of PartNetE contains 1,906 shapes covering 45 object categories. In addition, we randomly reserve 8 shapes from each of the 45 object categories for few-shot training. Also, we may utilize the additional 28,367 shapes from PartNet for training, which cover 17 out of 45 object categories and have consistent part annotations as the test set. Some of the original part categories in PartNet (e.g., "back frame vertical bar" for chairs) are too fine-grained and ambiguous to evaluate unsupervised text-driven part segmentation approaches. We thus select a subset of 103 parts when constructing the PartNetE dataset, which covers both common coarse-grained parts (e.g., chair back and tabletop) and fine-grained parts (e.g., wheel, handle, button, knob, switch, touchpad) that may be useful in downstream tasks such as robotic manipulation. See supplementary for more details of the dataset.</p><p>We follow <ref type="bibr">[40]</ref> to utilize category mIoU and mAP (50% IoU threshold) as the semantic and instance segmentation metrics, respectively. We first calculate mIoU/mAP50 for each part category across all test shapes, and then average part mIoUs/mAP50s that belong to each object category to compute the object category mIoU/mAP50.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Implementation Details</head><p>For each 3D shape (i.e., ShapeNet <ref type="bibr">[6]</ref> mesh), we use BlenderProc <ref type="bibr">[11]</ref> to render 6 views of RGB-D images and segmentation masks with a resolution of 512 &#215; 512. We unproject the images to the world space to obtain a fused point cloud with colors, normals, and ground truth part labels. The fused point clouds are used as the input for both our method and baseline approaches.</p><p>For our method, we render each input point cloud into K = 10 color images with Pytorch3D <ref type="bibr">[48]</ref>. In few-shot experiments, we utilize 8 point clouds (8 &#215; 10 rendered images with 2D bounding boxes) of each object category for prompt tuning. The threshold &#964; in part instance grouping is empirically set to 0.3.   Here, the last setting (45 &#215; 8 + 28k) describes a realistic setup, where we have large-scale part annotations for some common categories (17 categories in our case) but only a few shapes for the other categories. We aim to examine whether the 28k data of the 17 categories can help the part segmentation of the other 28 underrepresented categories. All settings are tested on the same test set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Comparison with Existing Methods</head><p>We compare with PointNet++ <ref type="bibr">[43]</ref> and PointNext <ref type="bibr">[44]</ref> for semantic segmentation, and compare with Point-Group <ref type="bibr">[23]</ref> and SoftGroup <ref type="bibr">[58]</ref> for instance segmentation. We train four baseline approaches on the PartNetE dataset by taking point clouds with normals as input. For semantic segmentation, we follow <ref type="bibr">[40]</ref> to sample 10,000 points per shape as network input. For instance segmentation, we sample up to 50,000 points per shape. For each pair of baseline and setting, we train a single network.</p><p>In addition to the four baselines mentioned above, we compare against two methods dedicated to few-shot 3D semantic segmentation: ACD <ref type="bibr">[12]</ref> and Prototype <ref type="bibr">[80]</ref>. In ACD, we decompose the mesh of each 3D shape into approximate convex components with CoACD <ref type="bibr">[66]</ref> and utilize the decomposition results for adding an auxiliary loss to the pipeline of PointNet++. In Prototype, we utilize the learned point features (by PointNext backbone) of few-shot shapes to construct 100 prototypes for each part category, which are then used to classify each point of test shapes. See supplementary for more details of baseline approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Evaluation Results</head><p>Table <ref type="table">1</ref> shows the results of semantic segmentation. Our method achieves impressive zero-shot performance on some common object categories (such as bottle, chair, and table), but also poor performances on certain categories (e.g., kettle). This is mainly due to the pretrained GLIP model may not understand the meaning of the text prompt (e.g., spout for kettles). After prompt tuning with 8-shot 3D data, our method achieves a 59.4% mIoU and outperforms Our method outperforms all baselines on non-overlapping categories by a large margin. The two few-shot strategies ACD and Prototype improve the performance of the original backbone, but there are still large gaps compared to our method. Please see Figure <ref type="figure">1</ref> for example results of our methods and see supplementary for qualitative comparison.</p><p>Table <ref type="table">2</ref> shows the results of instance segmentation. We observe similar phenomena as semantic segmentation. Our method achieves 18.0% mAP50 for the zero-shot setting and 44.8% mAP50 for the 8-shot setting, which outperforms all baseline approaches from both 45 &#215; 8 and 45 &#215; 8 + 28k settings. See Figure <ref type="figure">5</ref> for qualitative examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Ablation Studies</head><p>Proposed Components: We ablate the proposed components, and the results are shown in Table <ref type="table">3</ref>. For the first row, we only utilize the pretrained GLIP model. In order to get 3D semantic segmentation, we assign part labels to all visible points within bounding boxes. The numbers indicate that this strategy is less effective than our proposed 3D vot- ing and grouping module (second row). Moreover, without our proposed module, we are not able to get 3D instance segmentation. The second and third rows compare the impact of (8-shot) prompt tuning. We observe significant improvements, especially on the Kettle category, as the zeroshot GLIP model fails to understand the meaning of "spout" but it adapts to the definition after few-shot prompt tuning.</p><p>The second and fourth rows compare our multi-view feature aggregation module. Without utilizing any extra data for finetuning, we leverage multi-view 3D priors to help the GLIP model better understand the input 3D shape and thus improve performance. After integrating all three modules, we achieve the final good performance (last row).</p><p>Variations of Input Point Clouds: Table <ref type="table">4</ref> evaluates the robustness of our method about variations of input point clouds. We observe that when the input point cloud is par-  tial and does not cover all regions of the object, our method still performs well (second row). Also, we find that after removing the textures of the ShapeNet models and generating the input point cloud by using gray-scale images, our method can achieve good performance as well, suggesting that textures are less important in recognizing object parts. However, we find that the performance of our method may degrade when the input point cloud becomes sparse. On the one hand, sparse point clouds cause a larger domain gap for 2D renderings of point clouds. On the other hand, the sparsity makes it hard for our super point generation algorithm to produce good results. That being said, we want to point out that dense point clouds are already mostly available in our daily life (see Section 4.5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Number of Shapes in Prompt Tuning:</head><p>We ablate the number of shapes used for prompt tuning, and the results are shown in Figure <ref type="figure">6</ref> (left). We observe that only using one single shape for prompt tuning can already improve the performance of the pretrained GLIP model a lot in some categories (e.g., Kettle). Also, after using more than 4 shapes, the gain from increasing the number of shapes slows down.</p><p>We also find that prompt tuning is less effective for object categories that have richer appearance and structure variations (e.g., StorageFurniture). Number of 2D Views: We render K = 10 2D views for each input point cloud in our main experiments. We ablate the value of K, and the results are shown in Figure <ref type="figure">6</ref> (right). We observe a significant performance drop when K is reduced to 5 and also a mild gain when using a larger K.</p><p>Early Fusion vs. Late Fusion: In the last paragraph of Section 3.4, we discuss two choices for multi-view feature aggregation: early fusion and late fusion. Table <ref type="table">5</ref> compares these two choices and verifies that late fusion will even de-  grade the performance while early fusion is helpful. GLIP vs. CLIP: We have also considered using other pretrained vision-language models, such as CLIP <ref type="bibr">[45]</ref>. However, we find that the pretrained CLIP model fails to recognize fine-grained object parts and has difficulty generating region-level output. See supplementary for details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Real-World Demo</head><p>Thanks to the strong generalizability of the GLIP model, our method can be directly deployed in the real world without a significant domain gap. As shown in Figure <ref type="figure">7</ref>, we use an iPhone 12 Pro Max, equipped with a LiDAR sensor, to capture a video and feed the fused point cloud to our method. We observe similar performances as in our synthetic experiments. Please note that existing 3D networks are sensitive to the input format. For example, they assume objects are normalized in per-category canonical poses. Also, they need to overcome the significant domain gap, making it hard to deploy them directly in real scenarios. See supplementary for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion and Limitations</head><p>The current pipeline utilizes predicted bounding boxes from the GLIP model. We notice that GLIPv2 <ref type="bibr">[76]</ref> has 2D segmentation capabilities, but their pretrained model is not released at the time of submission. We admit that it will be more natural to use 2D segmentation results, which are more accurate than bounding boxes, from pretrained models. However, we want to point out that it is still non-trivial to get 3D instance segmentation even from multi-view 2D segmentation, and all components of our proposed method would still be useful (with necessary adaptations). A bigger concern is that our method cannot handle the interior points of objects. It also suffers from long running time due to point cloud rendering and multiple inferences of the GLIP model. Therefore, using our method to distill the knowledge of 2D VL models and train 3D foundation models is a promising future direction, which may lead to more efficient inferences.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Recent commodity-grade 3D scanning devices (e.g., iPhone 12 Pro) can already capture high-quality point clouds (see Figure7).</p></note>
		</body>
		</text>
</TEI>
