Multimodal Contextualized Semantic Parsing from Speech

Voas, Jordan; Mooney, Raymond; Harwath, David

Citation Details

his paper introduces Semantic Parsing in Contextual Environments (SPICE), a task aimed at improving artificial agents’ contextual awareness by integrating multimodal inputs with prior contexts. Unlike traditional semantic parsing, SPICE provides a structured and interpretable framework for dynamically updating an agent’s knowledge with new information, reflecting the complexity of human communication. To support this task, the authors develop the VG-SPICE dataset, which challenges models to construct visual scene graphs from spoken conversational exchanges, emphasizing the integration of speech and visual data. They also present the Audio-Vision Dialogue Scene Parser (AViD-SP), a model specifically designed for VG-SPICE. Both the dataset and model are released publicly, with the goal of advancing multimodal information processing and integration. more »

Award ID(s):: 2505865

PAR ID:: 10631956

Author(s) / Creator(s):: Voas, Jordan; Mooney, Raymond; Harwath, David

Publisher / Repository:: https://doi.org/10.48550/arXiv.2406.06438

Date Published:: 2024-06-10

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this