skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Acuity: Creating Realistic Digital Twins Through Multi-resolution Pointcloud Processing and Audiovisual Sensor Fusion
As augmented and virtual reality (AR/VR) technology matures, a method is desired to represent real-world persons visually and aurally in a virtual scene with high fidelity to craft an immersive and realistic user experience. Current technologies leverage camera and depth sensors to render visual representations of subjects through avatars, and microphone arrays are employed to localize and separate high-quality subject audio through beamforming. However, challenges remain in both realms. In the visual domain, avatars can only map key features (e.g., pose, expression) to a predetermined model, rendering them incapable of capturing the subjects’ full details. Alternatively, high-resolution point clouds can be utilized to represent human subjects. However, such three-dimensional data is computationally expensive to process. In the realm of audio, sound source separation requires prior knowledge of the subjects’ locations. However, it may take unacceptably long for sound source localization algorithms to provide this knowledge, which can still be error-prone, especially with moving objects. These challenges make it difficult for AR systems to produce real-time, high-fidelity representations of human subjects for applications such as AR/VR conferencing that mandate negligible system latency. We present Acuity, a real-time system capable of creating high-fidelity representations of human subjects in a virtual scene both visually and aurally. Acuity isolates subjects from high-resolution input point clouds. It reduces the processing overhead by performing background subtraction at a coarse resolution, then applying the detected bounding boxes to fine-grained point clouds. Meanwhile, Acuity leverages an audiovisual sensor fusion approach to expedite sound source separation. The estimated object location in the visual domain guides the acoustic pipeline to isolate the subjects’ voices without running sound source localization. Our results demonstrate that Acuity can isolate multiple subjects’ high-quality point clouds with a maximum latency of 70 ms and average throughput of over 25 fps, while separating audio in less than 30 ms. We provide the source code of Acuity at: https://github.com/nesl/Acuity.  more » « less
Award ID(s):
2211301
PAR ID:
10464627
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IoTDI '23: Proceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation
Page Range / eLocation ID:
79 to 92
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The perception of distance is a complex process that often involves sensory information beyond that of just vision. In this work, we investigated if depth perception based on auditory information can be calibrated, a process by which perceptual accuracy of depth judgments can be improved by providing feedback and then performing corrective actions. We further investigated if perceptual learning through carryover effects of calibration occurs in different levels of a virtual environment’s visibility based on different levels of virtual lighting. Users performed an auditory depth judgment task over several trials in which they walked where they perceived an aural sound to be, yielding absolute estimates of perceived distance. This task was performed in three sequential phases: pretest, calibration, posttest. Feedback on the perceptual accuracy of distance estimates was only provided in the calibration phase, allowing to study the calibration of auditory depth perception. We employed a 2 (Visibility of virtual environment) ×3 (Phase) ×5 (Target Distance) multi-factorial design, manipulating the phase and target distance as within-subjects factors, and the visibility of the virtual environment as a between-subjects factor. Our results revealed that users generally tend to underestimate aurally perceived distances in VR similar to the distance compression effects that commonly occur in visual distance perception in VR. We found that auditory depth estimates, obtained using an absolute measure, can be calibrated to become more accurate through feedback and corrective action. In terms of environment visibility, we find that environments visible enough to reveal their extent may contain visual information that users attune to in scaling aurally perceived depth. 
    more » « less
  2. The prevalent point cloud compression (PCC) standards of today are utilized to encode various types of point cloud data, allowing for reasonable bandwidth and storage usage. With increasing demand for high-fidelity three-dimensional (3D) models for a large variety of applications, including immersive visual communication, Augmented reality (AR) and Virtual Reality (VR), navigation, autonomous driving, and smart city, point clouds are seeing increasing usage and development to meet the increasing demands. However, with the advancements in 3D modelling and sensing, the amount of data required to accurately depict such representations and models is likewise ballooning to increasingly large proportions, leading to the development and standardization of the point cloud compression standards. In this article, we provide an overview of some topical and popular MPEG point cloud compression (PCC) standards. We discuss the development and applications of the Geometry-based PCC (G-PCC) and Video-based PCC (V-PCC) standards as they escalate in importance in an era of virtual reality and machine learning. Finally, we conclude our article describing the future research directions and applications of the PCC standards of today. 
    more » « less
  3. null (Ed.)
    Mobile Augmented Reality (AR) provides immersive experiences by aligning virtual content (holograms) with a view of the real world. When a user places a hologram it is usually expected that like a real object, it remains in the same place. However, positional errors frequently occur due to inaccurate environment mapping and device localization, to a large extent determined by the properties of natural visual features in the scene. In this demonstration we present SceneIt, the first visual environment rating system for mobile AR based on predictions of hologram positional error magnitude. SceneIt allows users to determine if virtual content placed in their environment will drift noticeably out of position, without requiring them to place that content. It shows that the severity of positional error for a given visual environment is predictable, and that this prediction can be calculated with sufficiently high accuracy and low latency to be useful in mobile AR applications. 
    more » « less
  4. Photorealistic avatars have become essential for immersive applications in virtual reality (VR) and augmented reality (AR), enabling lifelike interactions in areas such as training simulations, telemedicine, and virtual collaboration. These avatars bridge the gap between the physical and digital worlds, improving the user experience through realistic human representation. However, existing avatar creation techniques face significant challenges, including high costs, long creation times, and limited utility in virtual applications. Manual methods, such as MetaHuman, require extensive time and expertise, while automatic approaches, such as NeRF-based pipelines often lack efficiency, detailed facial expression fidelity, and are unable to be rendered at a speed sufficent for real-time applications. By involving several cutting-edge modern techniques, we introduce an end-to-end 3D Gaussian Splatting (3DGS) avatar creation pipeline that leverages monocular video input to create a scalable and efficient photorealistic avatar directly compatible with the Unity game engine. Our pipeline incorporates a novel Gaussian splatting technique with customized preprocessing that enables the user of ”in the wild” monocular video capture, detailed facial expression reconstruction and embedding within a fully rigged avatar model. Additionally, we present a Unity-integrated Gaussian Splatting Avatar Editor, offering a user-friendly environment for VR/AR application development. Experimental results validate the effectiveness of our preprocessing pipeline in standardizing custom data for 3DGS training and demonstrate the versatility of Gaussian avatars in Unity, highlighting the scalability and practicality of our approach. 
    more » « less
  5. Intelligent Virtual Agents (IVAs) received enormous attention in recent years due to significant improvements in voice communication technologies and the convergence of different research fields such as Machine Learning, Internet of Things, and Virtual Reality (VR). Interactive conversational IVAs can appear in different forms such as voice-only or with embodied audio-visual representations showing, for example, human-like contextually related or generic three-dimensional bodies. In this paper, we analyzed the benefits of different forms of virtual agents in the context of a VR exhibition space. Our results suggest positive evidence showing large benefits of both embodied and thematically related audio-visual representations of IVAs. We discuss implications and suggestions for content developers to design believable virtual agents in the context of such installations. 
    more » « less