skip to main content

Title: Multimodal Semi-Supervised Learning for 3D Objects
In recent years, semi-supervised learning has been widely explored and shows excellent data efficiency for 2D data. There is an emerging need to improve data efficiency for 3D tasks due to the scarcity of labeled 3D data. This paper explores how the coherence of different modalities of 3D data (e.g. point cloud, image, and mesh) can be used to improve data efficiency for both 3D classification and retrieval tasks. We propose a novel multimodal semi-supervised learning framework by introducing instance-level consistency constraint and a novel multimodal contrastive prototype (M2CP) loss. The instance-level consistency enforces the network to generate consistent representations for multimodal data of the same object regardless of its modality. The M2CP maintains a multimodal prototype for each class and learns features with small intra-class variations by minimizing the feature distance of each object to its prototype while maximizing the distance to the others. Our proposed framework significantly outperforms all the state-of-the-art counterparts for both classification and retrieval tasks by a large margin on the modelNet10 and ModelNet40 datasets.
Award ID(s):
Publication Date:
Journal Name:
The British Machine Vision Conference (BMVC)
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep learning now offers state-of-the-art accuracy for many prediction tasks. A form of deep learning called deep convolutional neural networks (CNNs) are especially popular on image, video, and time series data. Due to its high computational cost, CNN inference is often a bottleneck in analytics tasks on such data. Thus, a lot of work in the computer architecture, systems, and compilers communities study how to make CNN inference faster. In this work, we show that by elevating the abstraction level and re-imagining CNN inference as queries , we can bring to bear database-style query optimization techniques to improve CNN inference efficiency. We focus on tasks that perform CNN inference repeatedly on inputs that are only slightly different . We identify two popular CNN tasks with this behavior: occlusion-based explanations (OBE) and object recognition in videos (ORV). OBE is a popular method for “explaining” CNN predictions. It outputs a heatmap over the input to show which regions (e.g., image pixels) mattered most for a given prediction. It leads to many re-inference requests on locally modified inputs. ORV uses CNNs to identify and track objects across video frames. It also leads to many re-inference requests. We cast such tasks in a unifiedmore »manner as a novel instance of the incremental view maintenance problem and create a comprehensive algebraic framework for incremental CNN inference that reduces computational costs. We produce materialized views of features produced inside a CNN and connect them with a novel multi-query optimization scheme for CNN re-inference. Finally, we also devise novel OBE-specific and ORV-specific approximate inference optimizations exploiting their semantics. We prototype our ideas in Python to create a tool called Krypton that supports both CPUs and GPUs. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5× (respectively, 35×) to produce exact (respectively, high-quality approximate) results without raising resource requirements.« less
  2. The success of supervised learning requires large-scale ground truth labels which are very expensive, time- consuming, or may need special skills to annotate. To address this issue, many self- or un-supervised methods are developed. Unlike most existing self-supervised methods to learn only 2D image features or only 3D point cloud features, this paper presents a novel and effective self-supervised learning approach to jointly learn both 2D image features and 3D point cloud features by exploiting cross-modality and cross-view correspondences without using any human annotated labels. Specifically, 2D image features of rendered images from different views are extracted by a 2D convolutional neural network, and 3D point cloud features are extracted by a graph convolution neural network. Two types of features are fed into a two-layer fully connected neural network to estimate the cross-modality correspondence. The three networks are jointly trained (i.e. cross-modality) by verifying whether two sampled data of different modalities belong to the same object, meanwhile, the 2D convolutional neural network is additionally optimized through minimizing intra-object distance while maximizing inter-object distance of rendered images in different views (i.e. cross-view). The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks includingmore »multi-view 2D shape recognition, 3D shape recognition, multi-view 2D shape retrieval, 3D shape retrieval, and 3D part-segmentation. Extensive evaluations on all the five different tasks across different datasets demonstrate strong generalization and effectiveness of the learned 2D and 3D features by the proposed self-supervised method.« less
  3. The task of instance segmentation in videos aims to consistently identify objects at pixel level throughout the entire video sequence. Existing state-of-the-art methods either follow the tracking-bydetection paradigm to employ multi-stage pipelines or directly train a complex deep model to process the entire video clips as 3D volumes. However, these methods are typically slow and resourceconsuming such that they are often limited to offline processing. In this paper, we propose SRNet, a simple and efficient framework for joint segmentation and tracking of object instances in videos. The key to achieving both high efficiency and accuracy in our framework is to formulate the instance segmentation and tracking problem into a unified spatial-relation learning task where each pixel in the current frame relates to its object center, and each object center relates to its location in the previous frame. This unified learning framework allows our framework to perform join instance segmentation and tracking through a single stage while maintaining low overheads among different learning tasks. Our proposed framework can handle two different task settings and demonstrates comparable performance with state-of-the-art methods on two different benchmarks while running significantly faster.
  4. 3D object detection is an important yet demanding task that heavily relies on difficult to obtain 3D annotations. To reduce the required amount of supervision, we propose 3DIoUMatch, a novel semi-supervised method for 3D object detection applicable to both indoor and outdoor scenes. We leverage a teacher-student mutual learning framework to propagate information from the labeled to the unlabeled train set in the form of pseudo-labels. However, due to the high task complexity, we observe that the pseudo-labels suffer from significant noise and are thus not directly usable. To that end, we introduce a confidence-based filtering mechanism, inspired by FixMatch. We set confidence thresholds based upon the predicted objectness and class probability to filter low-quality pseudo-labels. While effective, we observe that these two measures do not sufficiently capture localization quality. We therefore propose to use the estimated 3D IoU as a localization metric and set category-aware self-adjusted thresholds to filter poorly localized proposals. We adopt VoteNet as our backbone detector on indoor datasets while we use PV-RCNN on the autonomous driving dataset, KITTI. Our method consistently improves state-of-the-art methods on both ScanNet and SUN-RGBD benchmarks by significant margins under all label ratios (including fully labeled setting). For example, when trainingmore »using only 10% labeled data on ScanNet, 3DIoUMatch achieves 7.7 absolute improvement on mAP@0.25 and 8.5 absolute improvement on mAP@0.5 upon the prior art. On KITTI, we are the first to demonstrate semi-supervised 3D object detection and our method surpasses a fully supervised baseline from 1.8% to 7.6% under different label ratio and categories.« less
  5. Haliloglu, Turkan (Ed.)
    Cryo-electron tomography (cryo-ET) provides 3D visualization of subcellular components in the near-native state and at sub-molecular resolutions in single cells, demonstrating an increasingly important role in structural biology in situ . However, systematic recognition and recovery of macromolecular structures in cryo-ET data remain challenging as a result of low signal-to-noise ratio (SNR), small sizes of macromolecules, and high complexity of the cellular environment. Subtomogram structural classification is an essential step for such task. Although acquisition of large amounts of subtomograms is no longer an obstacle due to advances in automation of data collection, obtaining the same number of structural labels is both computation and labor intensive. On the other hand, existing deep learning based supervised classification approaches are highly demanding on labeled data and have limited ability to learn about new structures rapidly from data containing very few labels of such new structures. In this work, we propose a novel approach for subtomogram classification based on few-shot learning. With our approach, classification of unseen structures in the training data can be conducted given few labeled samples in test data through instance embedding. Experiments were performed on both simulated and real datasets. Our experimental results show that we can make inferencemore »on new structures given only five labeled samples for each class with a competitive accuracy (> 0.86 on the simulated dataset with SNR = 0.1), or even one sample with an accuracy of 0.7644. The results on real datasets are also promising with accuracy > 0.9 on both conditions and even up to 1 on one of the real datasets. Our approach achieves significant improvement compared with the baseline method and has strong capabilities of generalizing to other cellular components.« less