skip to main content


Title: 3D SceneFlowNet: Self-supervised 3D Scene Flow Estimation based on Graph CNN
Despite deep learning approaches have achieved promising successes in 2D optical flow estimation, it is a challenge to accurately estimate scene flow in 3D space as point clouds are inherently lacking topological information. In this paper, we aim at handling the problem of self-supervised 3D scene flow estimation based on dynamic graph convolutional neural networks (GCNNs), namely 3D SceneFlowNet. To better learn geometric relationships among points, we introduce EdgeConv to learn multiple-level features in a pyramid from point clouds and a self-attention mechanism to apply the multi-level features to predict the final scene flow. Our trained model can efficiently process a pair of adjacent point clouds as input and predict a 3D scene flow accurately without any supervision. The proposed approach achieves superior performance on both synthetic ModelNet40 dataset and real LiDAR scans from KITTI Scene Flow 2015 datasets.  more » « less
Award ID(s):
2104032
NSF-PAR ID:
10426993
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE International Conference on Image Processing (ICIP)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Scene flow depicts the dynamics of a 3D scene, which is critical for various applications such as autonomous driving, robot navigation, AR/VR, etc. Conventionally, scene ?ow is estimated from dense/regular RGB video frames. With the development of depth-sensing technologies, precise 3D measurements are available via point clouds which have sparked new research in 3D scene flow. Nevertheless, it remains challenging to extract scene flow from point clouds due to the sparsity and irregularity in typical point cloud sampling patterns. One major issue related to irregular sampling is identified as the randomness during point set abstraction/feature extraction an elementary process in many flow estimation scenarios. A novel Spatial Abstraction with Attention (SA2) layer is accordingly proposed to alleviate the unstable abstraction problem. Moreover, a Temporal Abstraction with Attention (TA2) layer is proposed to rectify attention in temporal domain, leading to benefits with motions scaled in a larger range. Extensive analysis and experiments verified the motivation and significant performance gains of our method, dubbed as Flow Estimation via Spatial-Temporal Attention (FESTA), when compared to several state-of-the-art benchmarks of scene flow estimation. 
    more » « less
  2. 3D object recognition accuracy can be improved by learning the multi-scale spatial features from 3D spatial geometric representations of objects such as point clouds, 3D models, surfaces, and RGB-D data. Current deep learning approaches learn such features either using structured data representations (voxel grids and octrees) or from unstructured representations (graphs and point clouds). Learning features from such structured representations is limited by the restriction on resolution and tree depth while unstructured representations creates a challenge due to non-uniformity among data samples. In this paper, we propose an end-to-end multi-level learning approach on a multi-level voxel grid to overcome these drawbacks. To demonstrate the utility of the proposed multi-level learning, we use a multi-level voxel representation of 3D objects to perform object recognition. The multi-level voxel representation consists of a coarse voxel grid that contains volumetric information of the 3D object. In addition, each voxel in the coarse grid that contains a portion of the object boundary is subdivided into multiple fine-level voxel grids. The performance of our multi-level learning algorithm for object recognition is comparable to dense voxel representations while using significantly lower memory. 
    more » « less
  3. We present a novel deep neural network architecture for end-to-end scene flow estimation that directly operates on large-scale 3D point clouds. Inspired by Bilateral Convolutional Layers (BCL), we propose novel DownBCL, UpBCL, and CorrBCL operations that restore structural information from unstructured point clouds, and fuse information from two consecutive point clouds. Operating on discrete and sparse permutohedral lattice points, our architectural design is parsimonious in computational cost. Our model can efficiently process a pair of point cloud frames at once with a maximum of 86K points per frame. Our approach achieves state-of-the-art performance on the FlyingThings3D and KITTI Scene Flow 2015 datasets. Moreover, trained on synthetic data, our approach shows great generalization ability on real-world data and on different point densities without fine-tuning. 
    more » « less
  4. null (Ed.)
    Recently 3D scene understanding attracts attention for many applications, however, annotating a vast amount of 3D data for training is usually expensive and time consuming. To alleviate the needs of ground truth, we propose a self-supervised schema to learn 4D spatio-temporal features (i.e. 3 spatial dimensions plus 1 temporal dimension) from dynamic point cloud data by predicting the temporal order of sampled and shuffled point cloud clips. 3D sequential point cloud contains precious geometric and depth information to better recognize activities in 3D space compared to videos. To learn the 4D spatio-temporal features, we introduce 4D convolution neural networks to predict the temporal order on a self-created large scale dataset, NTU- PCLs, derived from the NTU-RGB+D dataset. The efficacy of the learned 4D spatio-temporal features is verified on two tasks: 1) Self-supervised 3D nearest neighbor retrieval; and 2) Self-supervised representation learning transferred for action recognition on smaller 3D dataset. Our extensive experiments prove the effectiveness of the proposed self-supervised learning method which achieves comparable results w.r.t. the fully-supervised methods on action recognition on MSRAction3D dataset. 
    more » « less
  5. 3D CT point clouds reconstructed from the original CT images are naturally represented in real-world coordinates. Compared with CT images, 3D CT point clouds contain invariant geometric features with irregular spatial distributions from multiple viewpoints. This paper rethinks pulmonary nodule detection in CT point cloud representations. We first extract the multi-view features from a sparse convolutional (SparseConv) encoder by rotating the point clouds with different angles in the world coordinate. Then, to simultaneously learn the discriminative and robust spatial features from various viewpoints, a nodule proposal optimization schema is proposed to obtain coarse nodule regions by aggregating consistent nodule proposals prediction from multi-view features. Last, the multi-level features and semantic segmentation features extracted from a SparseConv decoder are concatenated with multi-view features for final nodule region regression. Experiments on the benchmark dataset (LUNA16) demonstrate the feasibility of applying CT point clouds in lung nodule detection task. Furthermore, we observe that by combining multi-view predictions, the performance of the proposed framework is greatly improved compared to single-view, while the interior texture features of nodules from images are more suitable for detecting nodules in small sizes. 
    more » « less