skip to main content


Title: Deep Unsupervised Visual Odometry Via Bundle Adjusted Pose Graph Optimization
Unsupervised visual odometry as an active topic has attracted extensive attention, benefiting from its label-free practical value and robustness in real-world scenarios. However, the performance of camera pose estimation and tracking through deep neural network is still not as ideal as most other tasks, such as detection, segmentation and depth estimation, due to the lack of drift correction in the estimated trajectory and map optimization in the recovered 3D scenes. In this work, we introduce pose graph and bundle adjustment optimization to our network training process, which iteratively updates both the motion and depth estimations from the deep learning network, and enforces the refined outputs to further meet the unsupervised photometric and geometric constraints. The integration of pose graph and bundle adjustment is easy to implement and significantly enhances the training effectiveness. Experiments on KITTI dataset demonstrate that the introduced method achieves a significant improvement in motion estimation compared with other recent unsupervised monocular visual odometry algorithms.  more » « less
Award ID(s):
2334246 2105257 2334690 2104032
PAR ID:
10443218
Author(s) / Creator(s):
Date Published:
Journal Name:
IEEE International Conference on Robotics and Automation (ICRA)
Page Range / eLocation ID:
6131 to 6137
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Unsupervised visual odometry as an active topic has attracted extensive attention, benefiting from its label free practical value and robustness in real-world scenarios. However, the performance of camera pose estimation and tracking through deep neural network is still not as ideal as most other tasks, such as detection, segmentation and depth estimation, due to the lack of drift correction in the estimated trajectory and map optimization in the recovered 3D scenes. In this work, we introduce pose graph and bundle adjustment optimization to our network training process, which iteratively updates both the motion and depth estimations from the deep learning network, and enforces the refined outputs to further meet the unsupervised photometric and geometric constraints. The integration of pose graph and bundle adjustment is easy to implement and significantly enhances the training effectiveness. Experiments on KITTI dataset demonstrate that the introduced method achieves a significant improvement in motion estimation compared with other recent unsupervised monocular visual odometry algorithms. 
    more » « less
  2. Visual odometry (VO) and single image depth estimation are critical for robot vision, 3D reconstruction, and camera pose estimation that can be applied to autonomous driving, map building, augmented reality and many other applications. Various supervised learning models have been proposed to train the VO or single image depth estimation framework for each targeted scene to improve the performance recently. However, little effort has been made to learn these separate tasks together without requiring the collection of a significant number of labels. This paper proposes a novel unsupervised learning approach to simultaneously perceive VO and single image depth estimation. In our framework, either of these tasks can benefit from each other through simultaneously learning these two tasks. We correlate these two tasks by enforcing depth consistency between VO and single image depth estimation. Based on the single image depth estimation, we can resolve the most common and challenging scaling issue of monocular VO. Meanwhile, through training from a sequence of images, VO can enhance the single image depth estimation accuracy. The effectiveness of our proposed method is demonstrated through extensive experiments compared with current state-of-the-art methods on the benchmark datasets. 
    more » « less
  3. We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning. 
    more » « less
  4. Current deep neural network approaches for camera pose estimation rely on scene structure for 3D motion estimation, but this decreases the robustness and thereby makes cross-dataset generalization difficult. In contrast, classical approaches to structure from motion estimate 3D motion utilizing optical flow and then compute depth. Their accuracy, however, depends strongly on the quality of the optical flow. To avoid this issue, direct methods have been proposed, which separate 3D motion from depth estimation, but compute 3D motion using only image gradients in the form of normal flow. In this paper, we introduce a network NFlowNet, for normal flow estimation which is used to enforce robust and direct constraints. In particular, normal flow is used to estimate relative camera pose based on the cheirality (depth positivity) constraint. We achieve this by formulating the optimization problem as a differentiable cheirality layer, which allows for end-to-end learning of camera pose. We perform extensive qualitative and quantitative evaluation of the proposed DiffPoseNet’s sensitivity to noise and its generalization across datasets. We compare our approach to existing state-of-the-art methods on KITTI, TartanAir, and TUM-RGBD datasets. 
    more » « less
  5. Abstract

    Global optical flow estimation is the foundation stone for obtaining odometry which is used to enable aerial robot navigation. However, such a method has to be of low latency and high robustness whilst also respecting the size, weight, area and power (SWAP) constraints of the robot. A combination of cameras coupled with inertial measurement units (IMUs) has proven to be the best combination in order to obtain such low latency odometry on resource‐constrained aerial robots. Recently, deep learning approaches for visual inertial fusion have gained momentum due to their high accuracy and robustness. However, an equally noteworthy benefit for robotics of these techniques are their inherent scalability (adaptation to different sized aerial robots) and unification (same method works on different sized aerial robots). To this end, we present a deep learning approach called PRGFlow for obtaining global optical flow and then loosely fuse it with an IMU for full 6‐DoF (Degrees of Freedom) relative pose estimation (which is then integrated to obtain odometry). The network is evaluated on the MSCOCO dataset and the dead‐reckoned odometry on multiple real‐flight trajectories without any fine‐tuning or re‐training. A detailed benchmark comparing different network architectures and loss functions to enable scalability is also presented. It is shown that the method outperforms classical feature matching methods by 2 under noisy data. The supplementary material and code can be found athttp://prg.cs.umd.edu/PRGFlow.

     
    more » « less