skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MonoNav: MAV Navigation via Monocular Depth Estimation and Reconstruction
A major challenge in deploying the smallest of Micro Aerial Vehicle (MAV) platforms (< 100 g) is their inability to carry sensors that provide high-resolution metric depth information (e.g., LiDAR or stereo cameras). Current systems rely on end-to-end learning or heuristic approaches that directly map images to control inputs, and struggle to fly fast in unknown environments. In this work, we ask the following question: using only a monocular camera, optical odometry, and offboard computation, can we create metrically accurate maps to leverage the powerful path planning and navigation approaches employed by larger state-of-the-art robotic systems to achieve robust autonomy in unknown environments? We present MonoNav: a fast 3D reconstruction and navigation stack for MAVs that leverages recent advances in depth prediction neural networks to enable metrically accurate 3D scene reconstruction from a stream of monocular images and poses. MonoNav uses off-the-shelf pre-trained monocular depth estimation and fusion techniques to construct a map, then searches over motion primitives to plan a collision-free trajectory to the goal. In extensive hardware experiments, we demonstrate how MonoNav enables the Crazyflie (a 37 g MAV) to navigate fast (0.5 m/s) in cluttered indoor environments. We evaluate MonoNav against a state-of-the-art end-to-end approach, and find that the collision rate in navigation is significantly reduced (by a factor of 4). This increased safety comes at the cost of conservatism in terms of a 22% reduction in goal completion.  more » « less
Award ID(s):
2044149
PAR ID:
10509999
Author(s) / Creator(s):
;
Publisher / Repository:
Proceedings of the International Symposium on Experimental Robotics (ISER)
Date Published:
Journal Name:
International Symposium on Experimental Robotics (ISER)
Format(s):
Medium: X
Location:
Melia, Thailand
Sponsoring Org:
National Science Foundation
More Like this
  1. Billard, A.; Asfour, T.; Khatib, O. (Ed.)
    Underwater navigation presents several challenges, including unstructured unknown environments, lack of reliable localization systems (e.g., GPS), and poor visibility. Furthermore, good-quality obstacle detection sensors for underwater robots are scant and costly; and many sensors like RGB-D cameras and LiDAR only work in-air. To enable reliable mapless underwater navigation despite these challenges, we propose a low-cost end-to-end navigation system, based on a monocular camera and a fixed single-beam echo-sounder, that efficiently navigates an underwater robot to waypoints while avoiding nearby obstacles. Our proposed method is based on Proximal Policy Optimization (PPO), which takes as input current relative goal information, estimated depth images, echo-sounder readings, and previous executed actions, and outputs 3D robot actions in a normalized scale. End-to-end training was done in simulation, where we adopted domain randomization (varying underwater conditions and visibility) to learn a robust policy against noise and changes in visibility conditions. The experiments in simulation and real-world demonstrated that our proposed method is successful and resilient in navigating a low-cost underwater robot in unknown underwater environments. The implementation is made publicly available at https://github.com/dartmouthrobotics/deeprl-uw-robot-navigation. 
    more » « less
  2. Despite achieving impressive improvement in accuracy, most existing monocular 3D human mesh reconstruction methods require large-scale 2D/3D ground-truths for supervision, which limits their applications on unlabeled in-the-wild data that is ubiquitous. To alleviate the reliance on 2D/3D ground-truths, we present a self-supervised 3D human pose and shape reconstruction framework that relies only on self-consistency between intermediate representations of images and projected 2D predictions. Specifically, we extract 2D joints and depth maps from monocular images as proxy inputs, which provides complementary clues to infer accurate 3D human meshes. Furthermore, to reduce the impacts from noisy and ambiguous inputs while better concentrate on the high-quality information, we design an uncertainty-aware module to automatically learn the reliability of the inputs at body-joint level based on the consistency between 2D joints and depth map. Experiments on benchmark datasets show that our approach outperforms other state-of-the-art methods at similar supervision levels. 
    more » « less
  3. Real-time navigation in non-trivial environments by micro aerial vehicles (MAVs) predominantly relies on modelling the MAV with idealized geometry, such as a sphere. Simplified, conservative representations increase the likelihood of a planner failing to identify valid paths. That likelihood increases the more a robot's geometry differs from the idealized version. Few current approaches consider these situations; we are unaware of any that do so using perception space representations. This work introduces the egocan, a perception space obstacle representation using line-of-sight free space estimates, and 3D Gap, a perception space approach to gap finding for identifying goal-directed, collision-free directions of travel through 3D space. Both are integrated, with real-time considerations in mind, to define a local planner module of a hierarchical navigation system. The result is Aerial Local Planning in Perception Space (AeriaLPiPS). AeriaLPiPS is shown to be capable of safely navigating a MAV with non-idealized geometry through various environments, including those impassable by traditional real-time approaches. The open source implementation of this work is available at github.com/ivaROS/AeriaLPiPS. 
    more » « less
  4. In monocular visual-inertial navigation, it is desirable to initialize the system as quickly and robustly as possible. A state-of-the-art initialization method typically constructs a linear system to find a closed-form solution using the image features and inertial measurements and then refines the states with a nonlinear optimization. These methods generally require a few seconds of data, which however can be expedited (less than a second) by adding constraints from a robust but only up-to-scale monocular depth network in the nonlinear optimization. To further accelerate this process, in this work, we leverage the scale-less depth measurements instead in the linear initialization step that is performed prior to the nonlinear one, which only requires a single depth image for the first frame. Importantly, we show that the typical estimation of all feature states independently in the closed-form solution can be modeled as estimating only the scale and bias parameters of the learned depth map. As such, our formulation enables building a smaller minimal problem than the state of the art, which can be seamlessly integrated into RANSAC for robust estimation. Experiments show that our method has state-of-the-art initialization performance in simulation as well as on popular real-world datasets (TUM-VI, and EuRoC MAV). For the TUM-VI dataset in simulation as well as real-world, we demonstrate the superior initialization performance with only a 0.3 s window of data, which is the smallest ever reported, and validate that our method can initialize more often, robustly, and accurately in different challenging scenarios. 
    more » « less
  5. We introduce the first end-to-end learning-based solution to near-field Photometric Stereo (PS), where the light sources are close to the object of interest. This setup is especially useful for reconstructing large immobile objects. Our method is fast, producing a mesh from 52 512x384 resolution images in about 1 second on a commodity GPU, thus potentially unlocking several AR/VR applications. Existing approaches rely on optimization coupled with a far-field PS network operating on pixels or small patches. Using optimization makes these approaches slow and memory intensive (requiring 17GB GPU and 27GB of CPU memory) while using only pixels or patches makes them highly susceptible to noise and calibration errors. To address these issues, we develop a recursive multi-resolution scheme to estimate surface normal and depth maps of the whole image at each step. The predicted depth map at each scale is then used to estimate 'per-pixel lighting' for the next scale. This design makes our approach almost 45x faster and 2 degrees more accurate (11.3 vs. 13.3 degrees Mean Angular Error) than the state-of-the-art near-field PS reconstruction technique, which uses iterative optimization. 
    more » « less