Vision-Language Modeling for Scene Understanding and Reasoning of Vehicle-to-X Interactions

Wang, Hao; He, Suining

doi:10.1109/MASS66014.2025.00036

Citation Details

This content will become publicly available on October 6, 2026

Vision-Language Modeling for Scene Understanding and Reasoning of Vehicle-to-X Interactions

In the complex traffic environments, understanding how a focal vehicle interacts (e.g., maneuvers) with various traffic elements (e.g., other vehicles, pedestrians, and road infrastructures), i.e., vehicle-to-X interactions (VXIs), is essential for developing the advanced driving support and intelligent vehicles. To derive the VXI scene understanding, reasoning, and decision support (e.g., suggesting cautious move in response of a pedestrian crossing the street), this work takes into account the recent advances of multi-modality large language models (MLLMs). We develop VXI-SUR, a novel VXI Scene Understanding and Reasoning system based on vision-language modeling. VXI-SUR takes in the visual VXI scene, and generates the structured textual responses that interpret the VXI scene and suggests an appropriate decision (e.g., braking, slowing down). We have designed within VXI-SUR a VXI memory mechanism with both scene and knowledge augmentation mechanisms, and enabled scene-knowledge co-learning to capture complex correspondences across scenes and decisions. We have performed extensive and comprehensive evaluations of VXI-SUR based on an open-source dataset with ∼17k VXI scenes. We have conducted extensive experimentation studies upon VXI-SUR, and corroborated VXI awareness, description preciseness, semantic matching, and quality in understanding and reasoning the complex VXI scenes. more »

Award ID(s):: 2239897

PAR ID:: 10646109

Author(s) / Creator(s):: Wang, Hao ; He, Suining

Publisher / Repository:: Proceedings of IEEE 22nd International Conference on Mobile Ad Hoc and Smart Systems (MASS)

Date Published:: 2025-10-06

Page Range / eLocation ID:: 174 to 182

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on October 6, 2026
Conference Paper:
https://doi.org/10.1109/MASS66014.2025.00036

More Like this