skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Visual odometry with neuromorphic resonator networks
Visual odometry (VO) is a method used to estimate self-motion of a mobile robot using visual sensors. Unlike odometry based on integrating differential measurements that can accumulate errors, such as inertial sensors or wheel encoders, VO is not compromised by drift. However, image-based VO is computationally demanding, limiting its application in use cases with low-latency, low-memory and low-energy requirements. Neuromorphic hardware offers low-power solutions to many vision and artificial intelligence problems, but designing such solutions is complicated and often has to be assembled from scratch. Here we propose the use of vector symbolic architecture (VSA) as an abstraction layer to design algorithms compatible with neuromorphic hardware. Building from a VSA model for scene analysis, described in our companion paper, we present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks. Specifically, the proposed algorithm stores and updates a working memory of the presented visual environment. Based on this working memory, a resonator network estimates the changing location and orientation of the camera. We experimentally validate the neuromorphic VSA-based approach to VO with two benchmarks: one based on an event-camera dataset and the other in a dynamic scene with a robotic task.  more » « less
Award ID(s):
2211387
PAR ID:
10531559
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Nature
Date Published:
Journal Name:
Nature Machine Intelligence
Volume:
6
Issue:
6
ISSN:
2522-5839
Page Range / eLocation ID:
653 to 663
Subject(s) / Keyword(s):
Visual Odometry Vector Symbolic Architecture
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Analysing a visual scene by inferring the configuration of a generative model is widely considered the most flexible and generalizable approach to scene understanding. Yet, one major problem is the computational challenge of the inference procedure, involving a combinatorial search across object identities and poses. Here we propose a neuromorphic solution exploiting three key concepts: (1) a computational framework based on vector symbolic architectures (VSAs) with complex-valued vectors, (2) the design of hierarchical resonator networks to factorize the non-commutative transforms translation and rotation in visual scenes and (3) the design of a multi-compartment spiking phasor neuron model for implementing complex-valued resonator networks on neuromorphic hardware. The VSA framework uses vector binding operations to form a generative image model in which binding acts as the equivariant operation for g eo me tric t ra nsformations. A scene can therefore be described as a sum of vector products, which can then be efficiently factorized by a resonator network to infer objects and their poses. The hierarchical resonator network features a partitioned architecture in which vector binding is equivariant for horizontal and vertical translation within one partition and for rotation and scaling within the other partition. The spiking neuron model allows mapping the resonator network onto efficient and low-power neuromorphic hardware. Our approach is demonstrated on synthetic scenes composed of simple two-dimensional shapes undergoing rigid geometric transformations and colour changes. A companion paper demonstrates the same approach in real-world application scenarios for machine vision and robotics. 
    more » « less
  2. Visual odometry (VO) and single image depth estimation are critical for robot vision, 3D reconstruction, and camera pose estimation that can be applied to autonomous driving, map building, augmented reality and many other applications. Various supervised learning models have been proposed to train the VO or single image depth estimation framework for each targeted scene to improve the performance recently. However, little effort has been made to learn these separate tasks together without requiring the collection of a significant number of labels. This paper proposes a novel unsupervised learning approach to simultaneously perceive VO and single image depth estimation. In our framework, either of these tasks can benefit from each other through simultaneously learning these two tasks. We correlate these two tasks by enforcing depth consistency between VO and single image depth estimation. Based on the single image depth estimation, we can resolve the most common and challenging scaling issue of monocular VO. Meanwhile, through training from a sequence of images, VO can enhance the single image depth estimation accuracy. The effectiveness of our proposed method is demonstrated through extensive experiments compared with current state-of-the-art methods on the benchmark datasets. 
    more » « less
  3. We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning. 
    more » « less
  4. Camera tracking is an essential building block in a myriad of HCI applications. For example, commercial VR devices are equipped with dedicated hardware, such as laser-emitting beacon stations, to enable accurate tracking of VR headsets. However, this hardware remains costly. On the other hand, low-cost solutions such as IMU sensors and visual markers exist, but they suffer from large tracking errors. In this work, we bring high accuracy and low cost together to present MoiréBoard, a new 3-DOF camera position tracking method that leverages a seemingly irrelevant visual phenomenon, the moiré effect. Based on a systematic analysis of the moiré effect under camera projection, MoiréBoard requires no power nor camera calibration. It can be easily made at a low cost (e.g., through 3D printing), ready to use with any stock mobile devices with a camera. Its tracking algorithm is computationally efficient, able to run at a high frame rate. Although it is simple to implement, it tracks devices at high accuracy, comparable to the state-of-the-art commercial VR tracking systems. 
    more » « less
  5. We describe the design and performance of a high-fidelity wearable head-, body-, and eye-tracking system that offers significant improvement over previous such devices. This device’s sensors include a binocular eye tracker, an RGB-D scene camera, a high-frame-rate scene camera, and two visual odometry sensors, for a total of ten cameras, which we synchronize and record from with a data rate of over 700 MB/s. The sensors are operated by a mini-PC optimized for fast data collection, and powered by a small battery pack. The device records a subject’s eye, head, and body positions, simultaneously with RGB and depth data from the subject’s visual environment, measured with high spatial and temporal resolution. The headset weighs only 1.4 kg, and the backpack with batteries 3.9 kg. The device can be comfortably worn by the subject, allowing a high degree of mobility. Together, this system overcomes many limitations of previous such systems, allowing high-fidelity characterization of the dynamics of natural vision. 
    more » « less