NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

Duan, Lin; Xiu, Yanming; Gorlatova, Maria (March 2025, Applied Electronics)

Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.
more » « less
Free, publicly-accessible full text available March 10, 2026
ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality

Xiu, Yanming; Scargill, Timothy; Gorlatova, Maria (March 2025, IEEE transactions on visualization and computer graphics)

In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users' ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users' ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.
more » « less
Free, publicly-accessible full text available March 10, 2026
SEESys: Online Pose Error Estimation System for Visual SLAM

Hu, T; Scargill, T; Yang, F; Chen, Y; Lan, G; Gorlatova, M (November 2024, ACM SenSys 2024)

In this work, we introduce SEESys, the first system to provide online pose error estimation for Simultaneous Localization and Mapping (SLAM). Unlike prior offline error estimation approaches, the SEESys framework efficiently collects real-time system features and delivers accurate pose error magnitude estimates with low latency. This enables real-time quality-of-service information for downstream applications. To achieve this goal, we develop a SLAM system run-time status monitor (RTS monitor) that performs feature collection with minimal overhead, along with a multi-modality attention-based Deep SLAM Error Estimator (DeepSEE) for error estimation. We train and evaluate SEESys using both public SLAM benchmarks and a diverse set of synthetic datasets, achieving an RMSE of 0.235 cm of pose error estimation, which is 15.8% lower than the baseline. Additionally, we conduct a case study showcasing SEESys in a real-world scenario, where it is applied to a real-time audio error advisory system for human operators of a SLAM-enabled device. The results demonstrate that SEESys provides error estimates with an average end-to-end latency of 37.3 ms, and the audio error advisory reduces pose tracking error by 25%.
more » « less
Free, publicly-accessible full text available November 8, 2025
LOBSTAR: Language Model-based Obstruction Detection for Augmented Reality

Xiu, Y; Scargill, T; Gorlatova, M (October 2024, IEEE ISMAR 2024)

In Augmented Reality (AR), improper virtual content placement can obstruct real-world elements, causing confusion and degrading the experience. To address this, we present LOBSTAR (Language model-based OBSTruction detection for Augmented Reality), the first system leveraging a vision language model (VLM) to detect key objects and prevent obstructions in AR. We evaluated LOBSTAR using both real-world and virtual-scene images and developed a mobile app for AR content obstruction detection. Our results demonstrate that LOBSTAR effectively understands scenes and detects obstructive content with well-designed VLM prompts, achieving up to 96% accuracy and a detection latency of 580ms on a mobile app.
more » « less
Full Text Available
“Looking'' into Attention Patterns in Extended Reality: An Eye Tracking-based Study

Qu, Z; Byrne, R; Gorlatova, M (October 2024, IEEE ISMAR 2024)

Virtual reality (VR) simulations have been adopted to provide controllable environments for running augmented reality (AR) experiments in diverse scenarios. However, insufficient research has explored the impact of AR applications on users, especially their attention patterns, and whether VR simulations accurately replicate these effects. In this work, we propose to analyze user attention patterns via eye tracking during XR usage. To represent applications that provide both helpful guidance and irrelevant information, we built a Sudoku Helper app that includes visual hints and potential distractions during the puzzle-solving period. We conducted two user studies with 19 different users each in AR and VR, in which we collected eye tracking data, conducted gaze-based analysis, and trained machine learning (ML) models to predict user attentional states and attention control ability. Our results show that the AR app had a statistically significant impact on enhancing attention by increasing the fixated proportion of time, while the VR app reduced fixated time and made the users less focused. Results indicate that there is a discrepancy between VR simulations and the AR experience. Our ML models achieve 99.3% and 96.3% accuracy in predicting user attention control ability in AR and VR, respectively. A noticeable performance drop when transferring models trained on one medium to the other further highlights the gap between the AR experience and the VR simulation of it.
more » « less
Full Text Available
Environment Texture Optimization for Augmented Reality

https://doi.org/10.1145/3678510

Scargill, Tim; Janamsetty, Ritvik; Fronk, Christian; Eom, Sangjun; Gorlatova, Maria (August 2024, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies)

Augmented reality (AR) platforms now support persistent, markerless experiences, in which virtual content appears in the same place relative to the real world, across multiple devices and sessions. However, optimizing environments for these experiences remains challenging; virtual content stability is determined by the performance of device pose tracking, which depends on recognizable environment features, but environment texture can impair human perception of virtual content. Low-contrast 'invisible textures' have recently been proposed as a solution, but may result in poor tracking performance when combined with dynamic device motion. Here, we examine the use of invisible textures in detail, starting with the first evaluation in a realistic AR scenario. We then consider scenarios with more dynamic device motion, and conduct extensive game engine-based experiments to develop a method for optimizing invisible textures. For texture optimization in real environments, we introduce MoMAR, the first system to analyze motion data from multiple AR users, which generates guidance using situated visualizations. We show that MoMAR can be deployed while maintaining an average frame rate > 59fps, for five different devices. We demonstrate the use of MoMAR in a realistic case study; our optimized environment texture allowed users to complete a task significantly faster (p=0.003) than a complex texture.
more » « less
Full Text Available
3D Object Detection with VI-SLAM Point Clouds: The Impact of Object and Environment Characteristics on Model Performance

https://doi.org/10.1109/ICRA57147.2024.10610778

Duan, Lin; Scargill, Tim; Chen, Ying; Gorlatova, Maria (May 2024, IEEE)

3D object detection (OD) is a crucial element in scene understanding. However, most existing 3D OD models have been tailored to work with light detection and ranging (LiDAR) and RGB-D point cloud data, leaving their performance on commonly available visual-inertial simultaneous localization and mapping (VI-SLAM) point clouds unexamined. In this paper, we create and release two datasets: VIP500, 4772 VI-SLAM point clouds covering 500 different object and environment configurations, and VIP500-D, an accompanying set of 20 RGB-D point clouds for the object classes and shapes in VIP500. We then use these datasets to quantify the differences between VI-SLAM point clouds and dense RGB-D point clouds, as well as the discrepancies between VI-SLAM point clouds generated with different object and environment characteristics. Finally, we evaluate the performance of three leading OD models on the diverse data in our VIP500 dataset, revealing the promise of OD models trained on VI-SLAM data; we examine the extent to which both object and environment characteristics impact performance, along with the underlying causes.
more » « less
Full Text Available
BiGuide: A Bi-level Data Acquisition Guidance for Object Detection on Mobile Devices

https://doi.org/10.1109/IPSN61024.2024.00012

Duan, Lin; Chen, Ying; Qu, Zhehan; McGrath, Megan; Ehmke, Erin; Gorlatova, Maria (May 2024, IEEE)

Full Text Available

Search for: All records