skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An Unsupervised Approach for Simultaneous Visual Odometry and Single Image Depth Estimation
Visual odometry (VO) and single image depth estimation are critical for robot vision, 3D reconstruction, and camera pose estimation that can be applied to autonomous driving, map building, augmented reality and many other applications. Various supervised learning models have been proposed to train the VO or single image depth estimation framework for each targeted scene to improve the performance recently. However, little effort has been made to learn these separate tasks together without requiring the collection of a significant number of labels. This paper proposes a novel unsupervised learning approach to simultaneously perceive VO and single image depth estimation. In our framework, either of these tasks can benefit from each other through simultaneously learning these two tasks. We correlate these two tasks by enforcing depth consistency between VO and single image depth estimation. Based on the single image depth estimation, we can resolve the most common and challenging scaling issue of monocular VO. Meanwhile, through training from a sequence of images, VO can enhance the single image depth estimation accuracy. The effectiveness of our proposed method is demonstrated through extensive experiments compared with current state-of-the-art methods on the benchmark datasets.  more » « less
Award ID(s):
2104032 2334690
PAR ID:
10426989
Author(s) / Creator(s):
;
Date Published:
Journal Name:
IEEE International Joint Conference on Neural Network (IJCNN)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Single image depth estimation is a critical issue for robot vision, augmented reality, and many other applications when an image sequence is not available. Self-supervised single image depth estimation models target at predicting accurate disparity map just from one single image without ground truth supervision or stereo image pair during real applications. Compared with direct single image depth estimation, single image stereo algorithm can generate the depth from different camera perspectives. In this paper, we propose a novel architecture to infer accurate disparity by leveraging both spectral-consistency based learning model and view-prediction based stereo reconstruction algorithm. Direct spectral-consistency based method can avoid false positive matching in smooth regions. Single image stereo can preserve more distinct boundaries from another camera perspective. By learning confidence maps and designing a fusion strategy, the two disparities from the two approaches are able to be effectively fused to produce the refined disparity. Extensive experiments and ablations indicate that our method exploits both advantages of spectral consistency and view prediction, especially in constraining object boundaries and correcting wrong predicting regions. 
    more » « less
  2. We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning. 
    more » « less
  3. Self-supervised depth estimation has recently demonstrated promising performance compared to the supervised methods on challenging indoor scenes. However, the majority of efforts mainly focus on exploiting photometric and geometric consistency via forward image warping and backward image warping, based on monocular videos or stereo image pairs. The influence of defocus blur to depth estimation is neglected, resulting in a limited performance for objects and scenes in out of focus. In this work, we propose the first framework for simultaneous depth estimation from a single image and image focal stacks using depth-from-defocus and depth-from-focus algorithms. The proposed network is able to learn optimal depth mapping from the information contained in the blur of a single image, generate a simulated image focal stack and all-in-focus image, and train a depth estimator from an image focal stack. In addition to the validation of our method on both synthetic NYUv2 dataset and real DSLR dataset, we also collect our own dataset using a DSLR camera and further verify on it. Experiments demonstrate that our system surpasses the state-of-the-art supervised depth estimation method over 4% in accuracy and achieves superb performance among the methods without direct supervision on the synthesized NYUv2 dataset, which has been rarely explored. 
    more » « less
  4. One of the grand challenges in computer vision is to recover 3D poses and shapes of multiple human bodies with absolute scales from a single RGB image. The challenge stems from the inherent depth and scale ambiguity from a single view. The state of the art on 3D human pose and shape estimation mainly focuses on estimating the 3D joint locations relative to the root joint, defined as the pelvis joint. In this paper, a novel approach called Absolute-ROMP is proposed, which builds upon a one-stage multi-person 3D mesh predictor network, ROMP, to estimate multi-person 3D poses and shapes, but with absolute scales from a single RGB image. To achieve this, we introduce absolute root joint localization in the camera coordinate frame, which enables the estimation of 3D mesh coordinates of all persons in the image and their root joint locations normalized by the focal point. Moreover, a CNN and transformer hybrid network, called TransFocal, is proposed to predict the focal length of the image’s camera. This enables Absolute-ROMP to obtain absolute depth information of all joints in the camera coordinate frame, further improving the accuracy of our proposed method. The Absolute-ROMP is evaluated on the root joint localization and root-relative 3D pose estimation tasks on publicly available multi-person 3D pose datasets, and TransFocal is evaluated on a dataset created from the Pano360 dataset. Our proposed approach achieves state-of-the-art results on these tasks, outperforming existing methods or has competitive performance. Due to its real-time performance, our method is applicable to in-the-wild images and videos. 
    more » « less
  5. Visual odometry (VO) is a method used to estimate self-motion of a mobile robot using visual sensors. Unlike odometry based on integrating differential measurements that can accumulate errors, such as inertial sensors or wheel encoders, VO is not compromised by drift. However, image-based VO is computationally demanding, limiting its application in use cases with low-latency, low-memory and low-energy requirements. Neuromorphic hardware offers low-power solutions to many vision and artificial intelligence problems, but designing such solutions is complicated and often has to be assembled from scratch. Here we propose the use of vector symbolic architecture (VSA) as an abstraction layer to design algorithms compatible with neuromorphic hardware. Building from a VSA model for scene analysis, described in our companion paper, we present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks. Specifically, the proposed algorithm stores and updates a working memory of the presented visual environment. Based on this working memory, a resonator network estimates the changing location and orientation of the camera. We experimentally validate the neuromorphic VSA-based approach to VO with two benchmarks: one based on an event-camera dataset and the other in a dynamic scene with a robotic task. 
    more » « less