skip to main content


Title: Learning to Detect Mobile Objects from LiDAR Scans Without Labels
Current 3D object detectors for autonomous driving are almost entirely trained on human-annotated data. Although of high quality, the generation of such data is laborious and costly, restricting them to a few specific locations and object types. This paper proposes an alternative approach entirely based on unlabeled data, which can be collected cheaply and in abundance almost everywhere on earth. Our approach leverages several simple common sense heuristics to create an initial set of approximate seed labels. For example, relevant traffic participants are generally not persistent across multiple traversals of the same route, do not fly, and are never under ground. We demonstrate that these seed labels are highly effective to bootstrap a surprisingly accurate detector through repeated self-training without a single human annotated label. Code is available at https://github.com/YurongYou/MODEST.  more » « less
Award ID(s):
2118240
NSF-PAR ID:
10338440
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
IEEE / CVF Computer Vision and Pattern Recognition Conference
Page Range / eLocation ID:
1130 - 1140
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Current 3D object detectors for autonomous driving are almost entirely trained on human-annotated data. Although of high quality, the generation of such data is laborious and costly, restricting them to a few specific locations and object types. This paper proposes an alternative approach entirely based on unlabeled data, which can be collected cheaply and in abundance almost everywhere on earth. Our ap- proach leverages several simple common sense heuristics to create an initial set of approximate seed labels. For ex- ample, relevant traffic participants are generally not per- sistent across multiple traversals of the same route, do not fly, and are never under ground. We demonstrate that these seed labels are highly effective to bootstrap a surpris- ingly accurate detector through repeated self-training with- out a single human annotated label. Code is available at https:// github.com/ YurongYou/ MODEST . 
    more » « less
  2. Meila, Marina ; Zhang, Tong (Ed.)
    The label noise transition matrix, characterizing the probabilities of a training instance being wrongly annotated, is crucial to designing popular solutions to learning with noisy labels. Existing works heavily rely on finding “anchor points” or their approximates, defined as instances belonging to a particular class almost surely. Nonetheless, finding anchor points remains a non-trivial task, and the estimation accuracy is also often throttled by the number of available anchor points. In this paper, we propose an alternative option to the above task. Our main contribution is the discovery of an efficient estimation procedure based on a clusterability condition. We prove that with clusterable representations of features, using up to third-order consensuses of noisy labels among neighbor representations is sufficient to estimate a unique transition matrix. Compared with methods using anchor points, our approach uses substantially more instances and benefits from a much better sample complexity. We demonstrate the estimation accuracy and advantages of our estimates using both synthetic noisy labels (on CIFAR-10/100) and real human-level noisy labels (on Clothing1M and our self-collected human-annotated CIFAR-10). Our code and human-level noisy CIFAR-10 labels are available at https://github.com/UCSC-REAL/HOC. 
    more » « less
  3. null (Ed.)
    The success of supervised learning requires large-scale ground truth labels which are very expensive, time- consuming, or may need special skills to annotate. To address this issue, many self- or un-supervised methods are developed. Unlike most existing self-supervised methods to learn only 2D image features or only 3D point cloud features, this paper presents a novel and effective self-supervised learning approach to jointly learn both 2D image features and 3D point cloud features by exploiting cross-modality and cross-view correspondences without using any human annotated labels. Specifically, 2D image features of rendered images from different views are extracted by a 2D convolutional neural network, and 3D point cloud features are extracted by a graph convolution neural network. Two types of features are fed into a two-layer fully connected neural network to estimate the cross-modality correspondence. The three networks are jointly trained (i.e. cross-modality) by verifying whether two sampled data of different modalities belong to the same object, meanwhile, the 2D convolutional neural network is additionally optimized through minimizing intra-object distance while maximizing inter-object distance of rendered images in different views (i.e. cross-view). The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks including multi-view 2D shape recognition, 3D shape recognition, multi-view 2D shape retrieval, 3D shape retrieval, and 3D part-segmentation. Extensive evaluations on all the five different tasks across different datasets demonstrate strong generalization and effectiveness of the learned 2D and 3D features by the proposed self-supervised method. 
    more » « less
  4. Korban, Matthew ; Acton, Scott T ; Youngs, Peter ; Foster, Jonathan (Ed.)
    Instructional activity recognition is an analytical tool for the observation of classroom education. One of the primary challenges in this domain is dealing with the intri- cate and heterogeneous interactions between teachers, students, and instructional objects. To address these complex dynamics, we present an innovative activity recognition pipeline designed explicitly for instructional videos, leveraging a multi-semantic attention mechanism. Our novel pipeline uses a transformer network that incorporates several types of instructional seman- tic attention, including teacher-to-students, students-to-students, teacher-to-object, and students-to-object relationships. This com- prehensive approach allows us to classify various interactive activity labels effectively. The effectiveness of our proposed algo- rithm is demonstrated through its evaluation on our annotated instructional activity dataset. 
    more » « less
  5. An important task in human-computer interaction is to rank speech samples according to their expressive content. A preference learning framework is appropriate for obtaining an emotional rank for a set of speech samples. However, obtaining reliable labels for training a preference learning framework is a challenging task. Most existing databases provide sentence-level absolute attribute scores annotated by multiple raters, which have to be transformed to obtain preference labels. Previous studies have shown that evaluators anchor their absolute assessments on previously annotated samples. Hence, this study proposes a novel formulation for obtaining preference learning labels by only considering annotation trends assigned by a rater to consecutive samples within an evaluation session. The experiments show that the use of the proposed anchor-based ordinal labels leads to significantly better performance than models trained using existing alternative labels. 
    more » « less