skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Synthesizing scene-aware virtual reality teleport graphs
We present a novel approach for synthesizing scene-aware virtual reality teleport graphs, which facilitate navigation in indoor virtual environments by suggesting desirable teleport positions. Our approach analyzes panoramic views at candidate teleport positions by extracting scene perception graphs, which encode scene perception relationships between the observer and the surrounding objects, and predict how desirable the views at these positions are. We train a graph convolutional model to predict the scene perception scores of different teleport positions. Based on such predictions, we apply an optimization approach to sample a set of desirable teleport positions while considering other navigation properties such as coverage and connectivity to synthesize a teleport graph. Using teleport graphs, users can navigate virtual environments efficaciously. We demonstrate our approach for synthesizing teleport graphs for common indoor scenes. By conducting a user study, we validate the efficacy and desirability of navigating virtual environments via the synthesized teleport graphs. We also extend our approach to cope with different constraints, user preferences, and practical scenarios.  more » « less
Award ID(s):
1942531
PAR ID:
10321074
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ACM Transactions on Graphics
Volume:
40
Issue:
6
ISSN:
0730-0301
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce multimodal neural acoustic fields for synthesizing spatial sound and enabling the creation of immersive auditory experiences from novel viewpoints and in completely unseen new environments, both virtual and real. Extending the concept of neural radiance fields to acoustics, we develop a neural network-based model that maps an environment's geometric and visual features to its audio characteristics. Specifically, we introduce a novel hybrid transformer-convolutional neural network to accomplish two core tasks: capturing the reverberation characteristics of a scene from audio-visual data, and generating spatial sound in an unseen new environment from signals recorded at sparse positions and orientations within the original scene. By learning to represent spatial acoustics in a given environment, our approach enables creation of realistic immersive auditory experiences, thereby enhancing the sense of presence in augmented and virtual reality applications. We validate the proposed approach on both synthetic and real-world visual-acoustic data and demonstrate that our method produces nonlinear acoustic effects such as reverberations, and improves spatial audio quality compared to existing methods. Furthermore, we also conduct subjective user studies and demonstrate that the proposed framework significantly improves audio perception in immersive mixed reality applications. 
    more » « less
  2. Each view of our environment captures only a subset of our immersive surroundings. Yet, our visual experience feels seamless. A puzzle for human neuroscience is to determine what cognitive mechanisms enable us to overcome our limited field of view and efficiently anticipate new views as we sample our visual surroundings. Here, we tested whether memory-based predictions of upcoming scene views facilitate efficient perceptual judgments across head turns. We tested this hypothesis using immersive, head-mounted virtual reality (VR). After learning a set of immersive real-world environments, participants (n = 101 across 4 experiments) were briefly primed with a single view from a studied environment and then turned left or right to make a perceptual judgment about an adjacent scene view. We found that participants’ perceptual judgments were faster when they were primed with images from the same (vs. neutral or different) environments. Importantly, priming required memory: it only occurred in learned (vs. novel) environments, where the link between adjacent scene views was known. Further, consistent with a role in supporting active vision, priming only occurred in the direction of planned head turns and only benefited judgments for scene views presented in their learned spatiotopic positions. Taken together, we propose that memory-based predictions facilitate rapid perception across large-scale visual actions, such as head and body movements, and may be critical for efficient behavior in complex immersive environments. 
    more » « less
  3. Neural Radiance Fields (NeRFs) have been remarkably successful at synthesizing novel views of 3D scenes by optimizing a volumetric scene function. This scene function models how optical rays bring color information from a 3D object to the camera pixels. Radio frequency (RF) or audio signals can also be viewed as a vehicle for delivering information about the environment to a sensor. However, unlike camera pixels, an RF/audio sensor receives a mixture of signals that contain many environmental reflections (also called “multipath”). Is it still possible to infer the environment using such multipath signals? We show that with redesign, NeRFs can be taught to learn from multipath signals, and thereby “see” the environment. As a grounding application, we aim to infer the indoor floorplan of a home from sparse WiFi measurements made at multiple locations inside the home. Although a difficult inverse problem, our implicitly learnt floorplans look promising, and enables forward applications, such as indoor signal prediction and basic ray tracing. 
    more » « less
  4. Visual scenes are often remembered as if they were observed from a different viewpoint. Some scenes are remembered as farther than they appeared, and others as closer. These memory distortions—also known as boundary extension and contraction—are strikingly consistent for a given scene, but their cause remains unknown. We tested whether these distortions can be explained by an inferential process that adjusts scene memories toward high-probability views, using viewing depth as a test case. We first carried out a large-scale analysis of depth maps of natural indoor scenes to quantify the statistical probability of views in depth. We then assessed human observers’ memory for these scenes at various depths and found that viewpoint judgments were consistently biased toward the modal depth, even when just a few seconds elapsed between viewing and reporting. Thus, scenes closer than the modal depth showed a boundary-extension bias (remembered as farther-away), and scenes farther than the modal depth showed a boundary-contraction bias (remembered as closer). By contrast, scenes at the modal depth did not elicit a consistent bias in either direction. This same pattern of results was observed in a follow-up experiment using tightly controlled stimuli from virtual environments. Together, these findings show that scene memories are biased toward statistically probable views, which may serve to increase the accuracy of noisy or incomplete scene representations. 
    more » « less
  5. Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches. 
    more » « less