skip to main content

Title: TöRF: Time-of-Flight Radiance Fields for Dynamic Scene View Synthesis
Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with low reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors now available on modern smartphones.
Authors:
; ; ; ; ; ;
Award ID(s):
2008464
Publication Date:
NSF-PAR ID:
10309507
Journal Name:
Advances in neural information processing systems
Volume:
34
ISSN:
1049-5258
Sponsoring Org:
National Science Foundation
More Like this
  1. Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with lowmore »reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors that are now available on modern smartphones.« less
  2. Scene reconstruction using Monodepth2 (Monocular Depth Inference) which provides depth maps from a single RGB camera, the outputs are filled with noise and inconsistencies. Instance segmentation using a Mask R-CNN (Region Based Convolution Neural Networks) deep model can provide object segmentation results in 2D but lacks 3D information. In this paper we propose to integrate the results of Instance segmentation via Mask R-CNN’s, CAD model Car Shape Alignment, and depth from Monodepth2 together with classical dynamic vision techniques to create a High-level Semantic Model with separability, robustness, consistency and saliency. The model is useful for both virtualized rendering, semantic augmentedmore »reality and automatic driving. Experimental results are provided to validate the approach.« less
  3. We present 3DVNet, a novel multi-view stereo (MVS) depth-prediction method that combines the advantages of previous depth-based and volumetric MVS approaches. Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions, resulting in highly accurate predictions which agree on the underlying scene geometry. Unlike existing depth-prediction techniques, our method uses a volumetric 3D convolutional neural network (CNN) that operates in world space on all depth maps jointly. The network can therefore learn meaningful scene-level priors. Furthermore, unlike existing volumetric MVS techniques, our 3D CNN operates on a feature-augmented point cloud,more »allowing for effective aggregation of multi-view information and flexible iterative refinement of depth maps. Experimental results show our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics on the ScanNet dataset, as well as a selection of scenes from the TUM-RGBD and ICL-NUIM datasets. This shows that our method is both effective and generalizes to new settings.« less
  4. The Georgia Tech Miniature Autonomous Blimp (GT-MAB) needs localization algorithms to navigate to way-points in an indoor environment without leveraging an external motion capture system. Indoor aerial robots often require a motion capture system for localization or employ simultaneous localization and mapping (SLAM) algorithms for navigation. The proposed strategy for GT-MAB localization can be accomplished using lightweight sensors on a weight-constrained platform like the GT-MAB. We train an end-to-end convolutional neural network (CNN) that predicts the horizontal position and heading of the GT-MAB using video collected by an onboard monocular RGB camera. On the other hand, the height of themore »GT-MAB is estimated from measurements through a time-of-flight (ToF) single-beam laser sensor. The monocular camera and the single-beam laser sensor are sufficient for the localization algorithm to localize the GT-MAB in real time, achieving the averaged 3D positioning errors to be less than 20 cm, and the averaged heading errors to be less than 3 degrees. With the accuracy of our proposed localization method, we are able to use simple proportional-integral-derivative controllers to control the GT-MAB for waypoint navigation. Experimental results on the waypoint following are provided, which demonstrates the use of a CNN as the primary localization method for estimating the pose of an indoor robot that successfully enables navigation to specified waypoints.« less
  5. We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates or velocity information to predict future collisions. While previous work has focused on detecting immediate collision in the context of navigating Unmanned Aerial Vehicles, the detection was limited to a binary variable (i.e., collision or no collision). We propose a more fine-grained approach to collision forecastingmore »by predicting the exact time to collision in terms of milliseconds, which is more helpful for collision avoidance in the context of dynamic path planning. To evaluate our method, we have collected a novel dataset of over 13,000 indoor video segments each showing a trajectory of at least one person ending in a close proximity (a near collision) with the camera mounted on a mobile suitcase-shaped platform. Using this dataset, we do extensive experimentation on different temporal windows as input using an exhaustive list of state-of-the-art convolutional neural networks (CNNs). Our results show that our proposed multi-stream CNN is the best model for predicting time to near-collision. The average prediction error of our time to near-collision is 0.75 seconds across the test videos. The project webpage can be found at https://aashi7.github.io/NearCollision.html.« less