skip to main content

This content will become publicly available on March 1, 2023

Title: InDepth: Real-time Depth Inpainting for Mobile Augmented Reality
Mobile Augmented Reality (AR) demands realistic rendering of virtual content that seamlessly blends into the physical environment. For this reason, AR headsets and recent smartphones are increasingly equipped with Time-of-Flight (ToF) cameras to acquire depth maps of a scene in real-time. ToF cameras are cheap and fast, however, they suffer from several issues that affect the quality of depth data, ultimately hampering their use for mobile AR. Among them, scale errors of virtual objects - appearing much bigger or smaller than what they should be - are particularly noticeable and unpleasant. This article specifically addresses these challenges by proposing InDepth, a real-time depth inpainting system based on edge computing. InDepth employs a novel deep neural network (DNN) architecture to improve the accuracy of depth maps obtained from ToF cameras. The DNN fills holes and corrects artifacts in the depth maps with high accuracy and eight times lower inference time than the state of the art. An extensive performance evaluation in real settings shows that InDepth reduces the mean absolute error by a factor of four with respect to ARCore DepthLab. Finally, a user study reveals that InDepth is effective in rendering correctly-scaled virtual objects, outperforming DepthLab.
Authors:
; ; ; ; ;
Award ID(s):
2046072 1908051 1903136
Publication Date:
NSF-PAR ID:
10320717
Journal Name:
Proceedings of the ACM on interactive mobile wearable and ubiquitous technologies
Volume:
6
Issue:
1
ISSN:
2474-9567
Sponsoring Org:
National Science Foundation
More Like this
  1. Though virtual reality (VR) has been advanced to certain levels of maturity in recent years, the general public, especially the population of the blind and visually impaired (BVI), still cannot enjoy the benefit provided by VR. Current VR accessibility applications have been developed either on expensive head-mounted displays or with extra accessories and mechanisms, which are either not accessible or inconvenient for BVI individuals. In this paper, we present a mobile VR app that enables BVI users to access a virtual environment on an iPhone in order to build their skills of perception and recognition of the virtual environment andmore »the virtual objects in the environment. The app uses the iPhone on a selfie stick to simulate a long cane in VR, and applies Augmented Reality (AR) techniques to track the iPhone’s real-time poses in an empty space of the real world, which is then synchronized to the long cane in the VR environment. Due to the use of mixed reality (the integration of VR & AR), we call it the Mixed Reality cane (MR Cane), which provides BVI users auditory and vibrotactile feedback whenever the virtual cane comes in contact with objects in VR. Thus, the MR Cane allows BVI individuals to interact with the virtual objects and identify approximate sizes and locations of the objects in the virtual environment. We performed preliminary user studies with blind-folded participants to investigate the effectiveness of the proposed mobile approach and the results indicate that the proposed MR Cane could be effective to help BVI individuals in understanding the interaction with virtual objects and exploring 3D virtual environments. The MR Cane concept can be extended to new applications of navigation, training and entertainment for BVI individuals without more significant efforts.« less
  2. Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with lowmore »reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors that are now available on modern smartphones.« less
  3. Neural networks can represent and accurately reconstruct radiance fields for static 3D scenes (e.g., NeRF). Several works extend these to dynamic scenes captured with monocular video, with promising performance. However, the monocular setting is known to be an under-constrained problem, and so methods rely on data-driven priors for reconstructing dynamic content. We replace these priors with measurements from a time-of-flight (ToF) camera, and introduce a neural representation based on an image formation model for continuous-wave ToF cameras. Instead of working with processed depth maps, we model the raw ToF sensor measurements to improve reconstruction quality and avoid issues with lowmore »reflectance regions, multi-path interference, and a sensor's limited unambiguous depth range. We show that this approach improves robustness of dynamic scene reconstruction to erroneous calibration and large motions, and discuss the benefits and limitations of integrating RGB+ToF sensors now available on modern smartphones.« less
  4. Current collaborative augmented reality (AR) systems establish a common localization coordinate frame among users by exchanging and comparing maps comprised of feature points. However, relative positioning through map sharing struggles in dynamic or feature-sparse environments. It also requires that users exchange identical regions of the map, which may not be possible if they are separated by walls or facing different directions. In this paper, we present Cappella11Like its musical inspiration, Cappella utilizes collaboration among agents to forgo the need for instrumentation, an infrastructure-free 6-degrees-of-freedom (6DOF) positioning system for multi-user AR applications that uses motion estimates and range measurements between usersmore »to establish an accurate relative coordinate system. Cappella uses visual-inertial odometry (VIO) in conjunction with ultra-wideband (UWB) ranging radios to estimate the relative position of each device in an ad hoc manner. The system leverages a collaborative particle filtering formulation that operates on sporadic messages exchanged between nearby users. Unlike visual landmark sharing approaches, this allows for collaborative AR sessions even if users do not share the same field of view, or if the environment is too dynamic for feature matching to be reliable. We show that not only is it possible to perform collaborative positioning without infrastructure or global coordinates, but that our approach provides nearly the same level of accuracy as fixed infrastructure approaches for AR teaming applications. Cappella consists of an open source UWB firmware and reference mobile phone application that can display the location of team members in real time using mobile AR. We evaluate Cappella across mul-tiple buildings under a wide variety of conditions, including a contiguous 30,000 square foot region spanning multiple floors, and find that it achieves median geometric error in 3D of less than 1 meter.« less
  5. Edge-assisted Augmented Reality (AR) which offloads computeintensive Deep Neural Network (DNN)-based AR tasks to edge servers faces an important design challenge: how to pick the DNN model out of many choices proposed for each AR task for offloading. For each AR task, e.g., depth estimation, many DNN-based models have been proposed over time that vary in accuracy and complexity. In general, more accurate models are also more complex; they are larger and have longer inference time. Thus choosing a larger model in offloading can provide higher accuracy for the offloaded frames but also incur longer turnaround time, during which themore »AR app has to reuse the estimation result from the last offloaded frame, which can lead to lower average accuracy. In this paper, we experimentally study this design tradeoff using depth estimation as a case study. We design optimal offloading schedule and further consider the impact of numerous factors such as on-device fast tracking, frame downsizing and available network bandwidth. Our results show that for edge-assisted monocular depth estimation, with proper frame downsizing and fast tracking, compared to small models, the improved accuracy of large models can offset its longer turnaround time to provide higher average estimation accuracy across frames under both LTE and 5G mmWave.« less