Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.more » « lessFree, publicly-accessible full text available March 10, 2026
-
In Augmented Reality (AR), virtual content enhances user experience by providing additional information. However, improperly positioned or designed virtual content can be detrimental to task performance, as it can impair users' ability to accurately interpret real-world information. In this paper we examine two types of task-detrimental virtual content: obstruction attacks, in which virtual content prevents users from seeing real-world objects, and information manipulation attacks, in which virtual content interferes with users' ability to accurately interpret real-world information. We provide a mathematical framework to characterize these attacks and create a custom open-source dataset for attack evaluation. To address these attacks, we introduce ViDDAR (Vision language model-based Task-Detrimental content Detector for Augmented Reality), a comprehensive full-reference system that leverages Vision Language Models (VLMs) and advanced deep learning techniques to monitor and evaluate virtual content in AR environments, employing a user-edge-cloud architecture to balance performance with low latency. To the best of our knowledge, ViDDAR is the first system to employ VLMs for detecting task-detrimental content in AR settings. Our evaluation results demonstrate that ViDDAR effectively understands complex scenes and detects task-detrimental content, achieving up to 92.15% obstruction detection accuracy with a detection latency of 533 ms, and an 82.46% information manipulation content detection accuracy with a latency of 9.62 s.more » « lessFree, publicly-accessible full text available March 10, 2026
-
Inaccurate spatial tracking in extended reality (XR) headsets can cause virtual object jitter, misalignment, and user discomfort, limiting the headsets’ potential for immersive content and natural interactions. We develop a modular testbed to evaluate the tracking performance of commercial XR headsets, incorporating system calibration, tracking data acquisition, and result analysis, and allowing the integration of external cameras and IMU sensors for comparison with opensource VI-SLAM algorithms. Using this testbed, we quantitatively assessed spatial tracking accuracy under various user movements and environmental conditions for the latest XR headsets, Apple Vision Pro and Meta Quest 3. The Apple Vision Pro outperformed the Meta Quest 3, reducing relative pose error (RPE) and absolute pose error (APE) by 33.9% and 14.6%, respectively. While both headsets achieved sub-centimeter RPE in most cases, they exhibited APE exceeding 10 cm in challenging scenarios, highlighting the need for further improvements in reliability and accuracy.more » « lessFree, publicly-accessible full text available December 4, 2025
-
In this work, we introduce SEESys, the first system to provide online pose error estimation for Simultaneous Localization and Mapping (SLAM). Unlike prior offline error estimation approaches, the SEESys framework efficiently collects real-time system features and delivers accurate pose error magnitude estimates with low latency. This enables real-time quality-of-service information for downstream applications. To achieve this goal, we develop a SLAM system run-time status monitor (RTS monitor) that performs feature collection with minimal overhead, along with a multi-modality attention-based Deep SLAM Error Estimator (DeepSEE) for error estimation. We train and evaluate SEESys using both public SLAM benchmarks and a diverse set of synthetic datasets, achieving an RMSE of 0.235 cm of pose error estimation, which is 15.8% lower than the baseline. Additionally, we conduct a case study showcasing SEESys in a real-world scenario, where it is applied to a real-time audio error advisory system for human operators of a SLAM-enabled device. The results demonstrate that SEESys provides error estimates with an average end-to-end latency of 37.3 ms, and the audio error advisory reduces pose tracking error by 25%.more » « lessFree, publicly-accessible full text available November 8, 2025
-
In Augmented Reality (AR), improper virtual content placement can obstruct real-world elements, causing confusion and degrading the experience. To address this, we present LOBSTAR (Language model-based OBSTruction detection for Augmented Reality), the first system leveraging a vision language model (VLM) to detect key objects and prevent obstructions in AR. We evaluated LOBSTAR using both real-world and virtual-scene images and developed a mobile app for AR content obstruction detection. Our results demonstrate that LOBSTAR effectively understands scenes and detects obstructive content with well-designed VLM prompts, achieving up to 96% accuracy and a detection latency of 580ms on a mobile app.more » « less
-
Virtual reality (VR) simulations have been adopted to provide controllable environments for running augmented reality (AR) experiments in diverse scenarios. However, insufficient research has explored the impact of AR applications on users, especially their attention patterns, and whether VR simulations accurately replicate these effects. In this work, we propose to analyze user attention patterns via eye tracking during XR usage. To represent applications that provide both helpful guidance and irrelevant information, we built a Sudoku Helper app that includes visual hints and potential distractions during the puzzle-solving period. We conducted two user studies with 19 different users each in AR and VR, in which we collected eye tracking data, conducted gaze-based analysis, and trained machine learning (ML) models to predict user attentional states and attention control ability. Our results show that the AR app had a statistically significant impact on enhancing attention by increasing the fixated proportion of time, while the VR app reduced fixated time and made the users less focused. Results indicate that there is a discrepancy between VR simulations and the AR experience. Our ML models achieve 99.3% and 96.3% accuracy in predicting user attention control ability in AR and VR, respectively. A noticeable performance drop when transferring models trained on one medium to the other further highlights the gap between the AR experience and the VR simulation of it.more » « less
-
Augmented reality (AR) platforms now support persistent, markerless experiences, in which virtual content appears in the same place relative to the real world, across multiple devices and sessions. However, optimizing environments for these experiences remains challenging; virtual content stability is determined by the performance of device pose tracking, which depends on recognizable environment features, but environment texture can impair human perception of virtual content. Low-contrast 'invisible textures' have recently been proposed as a solution, but may result in poor tracking performance when combined with dynamic device motion. Here, we examine the use of invisible textures in detail, starting with the first evaluation in a realistic AR scenario. We then consider scenarios with more dynamic device motion, and conduct extensive game engine-based experiments to develop a method for optimizing invisible textures. For texture optimization in real environments, we introduce MoMAR, the first system to analyze motion data from multiple AR users, which generates guidance using situated visualizations. We show that MoMAR can be deployed while maintaining an average frame rate > 59fps, for five different devices. We demonstrate the use of MoMAR in a realistic case study; our optimized environment texture allowed users to complete a task significantly faster (p=0.003) than a complex texture.more » « less
-
3D object detection (OD) is a crucial element in scene understanding. However, most existing 3D OD models have been tailored to work with light detection and ranging (LiDAR) and RGB-D point cloud data, leaving their performance on commonly available visual-inertial simultaneous localization and mapping (VI-SLAM) point clouds unexamined. In this paper, we create and release two datasets: VIP500, 4772 VI-SLAM point clouds covering 500 different object and environment configurations, and VIP500-D, an accompanying set of 20 RGB-D point clouds for the object classes and shapes in VIP500. We then use these datasets to quantify the differences between VI-SLAM point clouds and dense RGB-D point clouds, as well as the discrepancies between VI-SLAM point clouds generated with different object and environment characteristics. Finally, we evaluate the performance of three leading OD models on the diverse data in our VIP500 dataset, revealing the promise of OD models trained on VI-SLAM data; we examine the extent to which both object and environment characteristics impact performance, along with the underlying causes.more » « less
-
The traditional freehand placement of an external ventricular drain (EVD) relies on empirical craniometric landmarks to guide the craniostomy and subsequent passage of the EVD catheter. The diameter and trajectory of the craniostomy physically limit the possible trajectories that can be achieved during the passage of the catheter. In this study, the authors implemented a mixed reality–guided craniostomy procedure to evaluate the benefit of an optimally drilled craniostomy to the accurate placement of the catheter. Optical marker–based tracking using an OptiTrack system was used to register the brain ventricular hologram and drilling guidance for craniostomy using a HoloLens 2 mixed reality headset. A patient-specific 3D-printed skull phantom embedded with intracranial camera sensors was developed to automatically calculate the EVD accuracy for evaluation. User trials consisted of one blind and one mixed reality–assisted craniostomy followed by a routine, unguided EVD catheter placement for each of two different drill bit sizes. A total of 49 participants were included in the study (mean age 23.4 years, 59.2% female). The mean distance from the catheter target improved from 18.6 ± 12.5 mm to 12.7 ± 11.3 mm (p = 0.0008) using mixed reality guidance for trials with a large drill bit and from 19.3 ± 12.7 mm to 10.1 ± 8.4 mm with a small drill bit (p < 0.0001). Accuracy using mixed reality was improved using a smaller diameter drill bit compared with a larger bit (p = 0.039). Overall, the majority of the participants were positive about the helpfulness of mixed reality guidance and the overall mixed reality experience. Appropriate indications and use cases for the application of mixed reality guidance to neurosurgical procedures remain an area of active inquiry. While prior studies have demonstrated the benefit of mixed reality–guided catheter placement using predrilled craniostomies, the authors demonstrate that real-time quantitative and visual feedback of a mixed reality–guided craniostomy procedure can independently improve procedural accuracy and represents an important tool for trainee education and eventual clinical implementation.more » « less
An official website of the United States government

Full Text Available