skip to main content


The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.

This content will become publicly available on November 8, 2025

Title: SEESys: Online Pose Error Estimation System for Visual SLAM
In this work, we introduce SEESys, the first system to provide online pose error estimation for Simultaneous Localization and Mapping (SLAM). Unlike prior offline error estimation approaches, the SEESys framework efficiently collects real-time system features and delivers accurate pose error magnitude estimates with low latency. This enables real-time quality-of-service information for downstream applications. To achieve this goal, we develop a SLAM system run-time status monitor (RTS monitor) that performs feature collection with minimal overhead, along with a multi-modality attention-based Deep SLAM Error Estimator (DeepSEE) for error estimation. We train and evaluate SEESys using both public SLAM benchmarks and a diverse set of synthetic datasets, achieving an RMSE of 0.235 cm of pose error estimation, which is 15.8% lower than the baseline. Additionally, we conduct a case study showcasing SEESys in a real-world scenario, where it is applied to a real-time audio error advisory system for human operators of a SLAM-enabled device. The results demonstrate that SEESys provides error estimates with an average end-to-end latency of 37.3 ms, and the audio error advisory reduces pose tracking error by 25%.  more » « less
Award ID(s):
2231975 2312760 2046072
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
ACM SenSys 2024
Date Published:
Subject(s) / Keyword(s):
SLAM, pose tracking, tracking error, error estimate, edge computing, deep learning
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Virtual content instability caused by device pose tracking error remains a prevalent issue in markerless augmented reality (AR), especially on smartphones and tablets. However, when examining environments which will host AR experiences, it is challenging to determine where those instability artifacts will occur; we rarely have access to ground truth pose to measure pose error, and even if pose error is available, traditional visualizations do not connect that data with the real environment, limiting their usefulness. To address these issues we present SiTAR (Situated Trajectory Analysis for Augmented Reality), the first situated trajectory analysis system for AR that incorporates estimates of pose tracking error. We start by developing the first uncertainty-based pose error estimation method for visual-inertial simultaneous localization and mapping (VI-SLAM), which allows us to obtain pose error estimates without ground truth; we achieve an average accuracy of up to 96.1% and an average FI score of up to 0.77 in our evaluations on four VI-SLAM datasets. Next, we present our SiTAR system, implemented for ARCore devices, combining a backend that supplies uncertainty-based pose error estimates with a frontend that generates situated trajectory visualizations. Finally, we evaluate the efficacy of SiTAR in realistic conditions by testing three visualization techniques in an in-the-wild study with 15 users and 13 diverse environments; this study reveals the impact both environment scale and the properties of surfaces present can have on user experience and task performance. 
    more » « less
  2. null (Ed.)
    Visual-inertial SLAM is essential for robot navigation in GPS-denied environments, e.g. indoor, underground. Conventionally, the performance of visual-inertial SLAM is evaluated with open-loop analysis, with a focus on the drift level of SLAM systems. In this paper, we raise the question on the importance of visual estimation latency in closed-loop navigation tasks, such as accurate trajectory tracking. To understand the impact of both drift and latency on visualinertial SLAM systems, a closed-loop benchmarking simulation is conducted, where a robot is commanded to follow a desired trajectory using the feedback from visual-inertial estimation. By extensively evaluating the trajectory tracking performance of representative state-of-the-art visual-inertial SLAM systems, we reveal the importance of latency reduction in visual estimation module of these systems. The findings suggest directions of future improvements for visual-inertial SLAM. 
    more » « less
  3. This paper aims to select features that contribute most to the pose estimation in VO/VSLAM. Unlike existing feature selection works that are focused on efficiency only, our method significantly improves the accuracy of pose tracking, while introducing little overhead. By studying the impact of feature selection towards least squares pose optimization, we demonstrate the applicability of improving accuracy via good feature selection. To that end, we introduce the Max-logDet metric to guide the feature selection, which is connected to the conditioning of least squares pose optimization problem. We then describe an efficient algorithm for approximately solving the NP-hard Max-logDet problem. Integrating MaxlogDet feature selection into a state-of-the-art visual SLAM system leads to accuracy improvements with low overhead, as demonstrated via evaluation on a public benchmark. 
    more » « less
  4. Immersive virtual tours based on 360-degree cameras, showing famous outdoor scenery, are becoming more and more desirable due to travel costs, pandemics and other constraints. To feel immersive, a user must receive the view accurately corresponding to her position and orientation in the virtual space when she moves inside, and this requires cameras’ orientations to be known. Outdoor tour contexts have numerous, ultra-sparse cameras deployed across a wide area, making camera pose estimation challenging. As a result, pose estimation techniques like SLAM, which require mobile or dense cameras, are not applicable. In this paper we present a novel strategy called 360ViewPET, which automatically estimates the relative poses of two stationary, ultra-sparse (15 meters apart) 360-degree cameras using one equirectangular image taken by each camera. Our experiments show that it achieves accurate pose estimation, with a mean error as low as 0.9 degree 
    more » « less
  5. As augmented and virtual reality (AR/VR) technology matures, a method is desired to represent real-world persons visually and aurally in a virtual scene with high fidelity to craft an immersive and realistic user experience. Current technologies leverage camera and depth sensors to render visual representations of subjects through avatars, and microphone arrays are employed to localize and separate high-quality subject audio through beamforming. However, challenges remain in both realms. In the visual domain, avatars can only map key features (e.g., pose, expression) to a predetermined model, rendering them incapable of capturing the subjects’ full details. Alternatively, high-resolution point clouds can be utilized to represent human subjects. However, such three-dimensional data is computationally expensive to process. In the realm of audio, sound source separation requires prior knowledge of the subjects’ locations. However, it may take unacceptably long for sound source localization algorithms to provide this knowledge, which can still be error-prone, especially with moving objects. These challenges make it difficult for AR systems to produce real-time, high-fidelity representations of human subjects for applications such as AR/VR conferencing that mandate negligible system latency. We present Acuity, a real-time system capable of creating high-fidelity representations of human subjects in a virtual scene both visually and aurally. Acuity isolates subjects from high-resolution input point clouds. It reduces the processing overhead by performing background subtraction at a coarse resolution, then applying the detected bounding boxes to fine-grained point clouds. Meanwhile, Acuity leverages an audiovisual sensor fusion approach to expedite sound source separation. The estimated object location in the visual domain guides the acoustic pipeline to isolate the subjects’ voices without running sound source localization. Our results demonstrate that Acuity can isolate multiple subjects’ high-quality point clouds with a maximum latency of 70 ms and average throughput of over 25 fps, while separating audio in less than 30 ms. We provide the source code of Acuity at: 
    more » « less