skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

This content will become publicly available on February 1, 2024

Title: Multi-Camera Lighting Estimation for Photorealistic Front-Facing Mobile Augmented Reality
Lighting understanding plays an important role in virtual object composition, including mobile augmented reality (AR) applications. Prior work often targets recovering lighting from the physical environment to support photorealistic AR rendering. Because the common workflow is to use a back-facing camera to capture the physical world for overlaying virtual objects, we refer to this usage pattern as back-facing AR. However, existing methods often fall short in supporting emerging front-facing mobile AR applications, e.g., virtual try-on where a user leverages a front-facing camera to explore the effect of various products (e.g., glasses or hats) of different styles. This lack of support can be attributed to the unique challenges of obtaining 360° HDR environment maps, an ideal format of lighting representation, from the front-facing camera and existing techniques. In this paper, we propose to leverage dual-camera streaming to generate a high-quality environment map by combining multi-view lighting reconstruction and parametric directional lighting estimation. Our preliminary results show improved rendering quality using a dual-camera setup for front-facing AR compared to a commercial solution.  more » « less
Award ID(s):
1815619 2105564 2236987
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
HotMobile '23: Proceedings of the 24th International Workshop on Mobile Computing Systems and Applications
Page Range / eLocation ID:
68 to 73
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. An accurate understanding of omnidirectional environment lighting is crucial for high-quality virtual object rendering in mobile augmented reality (AR). In particular, to support reflective rendering, existing methods have leveraged deep learning models to estimate or have used physical light probes to capture physical lighting, typically represented in the form of an environment map. However, these methods often fail to provide visually coherent details or require additional setups. For example, the commercial framework ARKit uses a convolutional neural network that can generate realistic environment maps; however the corresponding reflective rendering might not match the physical environments. In this work, we present the design and implementation of a lighting reconstruction framework called LITAR that enables realistic and visually-coherent rendering. LITAR addresses several challenges of supporting lighting information for mobile AR. First, to address the spatial variance problem, LITAR uses two-field lighting reconstruction to divide the lighting reconstruction task into the spatial variance-aware near-field reconstruction and the directional-aware far-field reconstruction. The corresponding environment map allows reflective rendering with correct color tones. Second, LITAR uses two noise-tolerant data capturing policies to ensure data quality, namely guided bootstrapped movement and motion-based automatic capturing. Third, to handle the mismatch between the mobile computation capability and the high computation requirement of lighting reconstruction, LITAR employs two novel real-time environment map rendering techniques called multi-resolution projection and anchor extrapolation. These two techniques effectively remove the need of time-consuming mesh reconstruction while maintaining visual quality. Lastly, LITAR provides several knobs to facilitate mobile AR application developers making quality and performance trade-offs in lighting reconstruction. We evaluated the performance of LITAR using a small-scale testbed experiment and a controlled simulation. Our testbed-based evaluation shows that LITAR achieves more visually coherent rendering effects than ARKit. Our design of multi-resolution projection significantly reduces the time of point cloud projection from about 3 seconds to 14.6 milliseconds. Our simulation shows that LITAR, on average, achieves up to 44.1% higher PSNR value than a recent work Xihe on two complex objects with physically-based materials. 
    more » « less
  2. Neural Radiance Field (NeRF) based rendering has attracted growing attention thanks to its state-of-the-art (SOTA) rendering quality and wide applications in Augmented and Virtual Reality (AR/VR). However, immersive real-time (> 30 FPS) NeRF based rendering enabled interactions are still limited due to the low achievable throughput on AR/VR devices. To this end, we first profile SOTA efficient NeRF al- gorithms on commercial devices and identify two primary causes of the aforementioned inefficiency: (1) the uniform point sampling and (2) the dense accesses and computations of the required embeddings in NeRF. Furthermore, we propose RT-NeRF, which to the best of our knowledge is the first algorithm-hardware co-design acceleration of NeRF. Specifically, on the algorithm level, RT-NeRF integrates an efficient rendering pipeline for largely alleviating the inefficiency due to the commonly adopted uniform point sampling method in NeRF by directly computing the geometry of pre-existing points. Additionally, RT-NeRF leverages a coarse-grained view-dependent computing ordering scheme for eliminating the (unnecessary) pro- cessing of invisible points. On the hardware level, our proposed RT-NeRF accelerator (1) adopts a hybrid encoding scheme to adap- tively switch between a bitmap- or coordinate-based sparsity encoding format for NeRF’s sparse embeddings, aiming to maximize the storage savings and thus reduce the required DRAM accesses while supporting efficient NeRF decoding; and (2) integrates both a high-density sparse search unit and a dual-purpose bi-direction adder & search tree to coordinate the two aforementioned encod- ing formats. Extensive experiments on eight datasets consistently validate the effectiveness of RT-NeRF, achieving a large throughput improvement (e.g., 9.7×∼3,201×) while maintaining the rendering quality as compared with SOTA efficient NeRF solutions. 
    more » « less
  3. Mobile Augmented Reality (AR) demands realistic rendering of virtual content that seamlessly blends into the physical environment. For this reason, AR headsets and recent smartphones are increasingly equipped with Time-of-Flight (ToF) cameras to acquire depth maps of a scene in real-time. ToF cameras are cheap and fast, however, they suffer from several issues that affect the quality of depth data, ultimately hampering their use for mobile AR. Among them, scale errors of virtual objects - appearing much bigger or smaller than what they should be - are particularly noticeable and unpleasant. This article specifically addresses these challenges by proposing InDepth, a real-time depth inpainting system based on edge computing. InDepth employs a novel deep neural network (DNN) architecture to improve the accuracy of depth maps obtained from ToF cameras. The DNN fills holes and corrects artifacts in the depth maps with high accuracy and eight times lower inference time than the state of the art. An extensive performance evaluation in real settings shows that InDepth reduces the mean absolute error by a factor of four with respect to ARCore DepthLab. Finally, a user study reveals that InDepth is effective in rendering correctly-scaled virtual objects, outperforming DepthLab. 
    more » « less
  4. As augmented and virtual reality (AR/VR) technology matures, a method is desired to represent real-world persons visually and aurally in a virtual scene with high fidelity to craft an immersive and realistic user experience. Current technologies leverage camera and depth sensors to render visual representations of subjects through avatars, and microphone arrays are employed to localize and separate high-quality subject audio through beamforming. However, challenges remain in both realms. In the visual domain, avatars can only map key features (e.g., pose, expression) to a predetermined model, rendering them incapable of capturing the subjects’ full details. Alternatively, high-resolution point clouds can be utilized to represent human subjects. However, such three-dimensional data is computationally expensive to process. In the realm of audio, sound source separation requires prior knowledge of the subjects’ locations. However, it may take unacceptably long for sound source localization algorithms to provide this knowledge, which can still be error-prone, especially with moving objects. These challenges make it difficult for AR systems to produce real-time, high-fidelity representations of human subjects for applications such as AR/VR conferencing that mandate negligible system latency. We present Acuity, a real-time system capable of creating high-fidelity representations of human subjects in a virtual scene both visually and aurally. Acuity isolates subjects from high-resolution input point clouds. It reduces the processing overhead by performing background subtraction at a coarse resolution, then applying the detected bounding boxes to fine-grained point clouds. Meanwhile, Acuity leverages an audiovisual sensor fusion approach to expedite sound source separation. The estimated object location in the visual domain guides the acoustic pipeline to isolate the subjects’ voices without running sound source localization. Our results demonstrate that Acuity can isolate multiple subjects’ high-quality point clouds with a maximum latency of 70 ms and average throughput of over 25 fps, while separating audio in less than 30 ms. We provide the source code of Acuity at: 
    more » « less
  5. In this paper, we study how to support high-quality immer- sive multiplayer VR on commodity mobile devices. First, we perform a scaling experiment that shows simply replicating the prior-art 2-layer distributed VR rendering architecture to multiple players cannot support more than one player due to the linear increase in network bandwidth requirement. Second, we propose to exploit the similarity of background environment (BE) frames to reduce the bandwidth needed for prefetching BE frames from the server, by caching and reusing similar frames. We nd that there is often little sim- ilarly between the BE frames of even adjacent locations in the virtual world due to a “near-object” e ect. We propose a novel technique that splits the rendering of BE frames between the mobile device and the server that drastically enhances the similarity of the BE frames and reduces the network load from frame caching. Evaluation of our imple- mentation on top of Unity and Google Daydream shows our new VR framework, Coterie, reduces per-player network requirement by 10.6X-25.7X and easily supports 4 players for high-resolution VR apps on Pixel 2 over 802.11ac, with 60 FPS and under 16ms responsiveness. 
    more » « less