Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
The success of supervised learning requires large-scale ground truth labels which are very expensive, time- consuming, or may need special skills to annotate. To address this issue, many self- or un-supervised methods are developed. Unlike most existing self-supervised methods to learn only 2D image features or only 3D point cloud features, this paper presents a novel and effective self-supervised learning approach to jointly learn both 2D image features and 3D point cloud features by exploiting cross-modality and cross-view correspondences without using any human annotated labels. Speciﬁcally, 2D image features of rendered images from different views are extracted by a 2D convolutional neural network, and 3D point cloud features are extracted by a graph convolution neural network. Two types of features are fed into a two-layer fully connected neural network to estimate the cross-modality correspondence. The three networks are jointly trained (i.e. cross-modality) by verifying whether two sampled data of different modalities belong to the same object, meanwhile, the 2D convolutional neural network is additionally optimized through minimizing intra-object distance while maximizing inter-object distance of rendered images in different views (i.e. cross-view). The effectiveness of the learned 2D and 3D features is evaluated by transferring them on ﬁve different tasks includingmore »
Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities. Unlike the existing methods which usually learn from the features extracted by ofﬂine networks, in this paper, we pro- pose an approach to jointly train the components of cross- modal retrieval framework with metadata, and enable the network to ﬁnd optimal features. The proposed end-to-end framework is updated with three loss functions: 1) a novel cross-modal center loss to eliminate cross-modal discrepancy, 2) cross-entropy loss to maximize inter-class variations, and 3) mean-square-error loss to reduce modality variations. In particular, our proposed cross-modal center loss minimizes the distances of features from objects belonging to the same class across all modalities. Extensive experiments have been conducted on the retrieval tasks across multi-modalities including 2D image, 3D point cloud and mesh data. The proposed framework significantly outperforms the state-of-the-art methods for both cross-modal and in-domain retrieval for 3D objects on the ModelNet10 and ModelNet40 datasets.
We propose a semi-supervised learning approach for video classiﬁcation, VideoSSL, using convolutional neural networks (CNN). Like other computer vision tasks, existing supervised video classiﬁcation methods demand a large amount of labeled data to attain good performance. However, annotation of a large dataset is expensive and time consuming. To minimize the dependence on a large annotated dataset, our proposed semi-supervised method trains from a small number of labeled examples and exploits two regulatory signals from unlabeled data. The ﬁrst signal is the pseudo-labels of unlabeled examples computed from the conﬁdences of the CNN being trained. The other is the normalized probabilities, as predicted by an image classiﬁer CNN, that captures the information about appearances of the interesting objects in the video. We show that, under the supervision of these guiding signals from unlabeled examples, a video classiﬁcation CNN can achieve impressive performances utilizing a small fraction of annotated examples on three publicly available datasets: UCF101, HMDB51, and Kinetics.
The recent advances in algorithmic photo-editing and the vulnerability of hospitals to cyberattacks raises the concern about the tampering of medical images. This paper introduces a new large scale dataset of tampered Computed Tomography (CT) scans generated by different methods, LuNoTim-CT dataset, which can serve as the most comprehensive testbed for comparative studies of data security in healthcare. We further propose a deep learning-based framework, ConnectionNet, to automatically detect if a medical image is tampered. The proposed ConnectionNet is able to handle small tampered regions and achieves promising results and can be used as the baseline for studies of medical image tampering detection.