skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Lexical Event Models for Multimodal Dialogues
In order to understand multimodal interactions between humans or humans and machine, it is minimally necessary to identify the content of the agents’ communicative acts in the dialogue. This can involve either overt linguistic expressions (speech or writing), content-bearing gesture, or the integration of both. But this content must be interpreted relative to a deeper understanding of an agent’s Theory of Mind (one’s mental state, desires, and intentions) in the context of the dialogue as it dynamically unfolds. This, in turn, can require identifying and tracking nonverbal behaviors, such as gaze, body posture, facial expressions, and actions, all of which contribute to understanding how expressions are contextualized in the dialogue, and interpreted relative to the epistemic attitudes of each agent. In this paper, we adopt Generative Lexicon’s approach to event structure to provide a lexical semantics for ontic and epistemic actions as used in Bolander’s interpretation of Dynamic Epistemic Logic, called Lexical Event Modeling (LEM). This allows for the compositional construction of epistemic models of a dialogue state. We demonstrate how veridical and false belief scenarios are treated compositionally within this model.  more » « less
Award ID(s):
2019805
PAR ID:
10586895
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Nature Switzerland
Date Published:
Page Range / eLocation ID:
174 to 192
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Humans easily interpret expressions that describe unfamiliar situations composed from familiar parts ("greet the pink brontosaurus by the ferris wheel"). Modern neural networks, by contrast, struggle to interpret novel compositions. In this paper, we introduce a new benchmark, gSCAN, for evaluating compositional generalization in situated language understanding. Going beyond a related benchmark that focused on syntactic aspects of generalization, gSCAN defines a language grounded in the states of a grid world, facilitating novel evaluations of acquiring linguistically motivated rules. For example, agents must understand how adjectives such as 'small' are interpreted relative to the current world state or how adverbs such as 'cautiously' combine with new verbs. We test a strong multi-modal baseline model and a state-of-the-art compositional method finding that, in most cases, they fail dramatically when generalization requires systematic compositional rules. 
    more » « less
  2. Analyzing different modalities of expression can provide insights into the ways that humans interpret, label, and react to images. Such insights have the potential not only to advance our understanding of how humans coordinate these expressive modalities but also to enhance existing methodologies for common AI tasks such as image annotation and classification. We conducted an experiment that co-captured the facial expressions, eye movements, and spoken language data that observers produce while examining images of varying emotional content and responding to description-oriented vs. affect-oriented questions about those images. We analyzed the facial expressions produced by the observers in order to determine the connection between those expressions and an image's emotional content. We also explored the relationship between the valence of an image and the verbal responses to that image, and how that relationship relates to the nature of the prompt, using low-level lexical features and more complex affective features extracted from the observers' verbal responses. Finally, in order to integrate this multimodal data, we extended an existing bitext alignment framework to create meaningful pairings between narrated observations about images and the image regions indicated by eye movement data. The resulting annotations of image regions with words from observers' responses demonstrate the potential of bitext alignment for multimodal data integration and, from an application perspective, for annotation of open-domain images. In addition, we found that while responses to affect-oriented questions appear useful for image understanding, their holistic nature seems less helpful for image region annotation. 
    more » « less
  3. Abstract Understanding students’ multi-party epistemic and topic based-dialogue contributions, or how students present knowledge in group-based chat interactions during collaborative game-based learning, offers valuable insights into group dynamics and learning processes. However, manually annotating these contributions is labor-intensive and challenging. To address this, we develop an automated method for recognizing dialogue acts from text chat data of small groups of middle school students interacting in a collaborative game-based learning environment. Our approach utilizes dual contrastive learning and label-aware data augmentation to fine-tune large language models’ underlying embedding representations within a supervised learning framework for epistemic and topic-based dialogue act classification. Results show that our method achieves a performance improvement of 4% to 8% over baseline methods in two key classification scenarios. These findings highlight the potential for automated dialogue act recognition to support understanding of how meaning-making occurs by focusing on the development and evolution of knowledge in group discourse, ultimately providing teachers with actionable insights to better support student learning. 
    more » « less
  4. Earlier epistemic planning systems for multi-agent domains generate plans that contain various types of actions such as ontic, sensing, or announcement actions. However, none of these systems consider untruthful announcements, i.e., none can generate plans that contain a lying or a misleading announcement. In this paper, we present a novel epistemic planner, called EFP3.0, for multi-agent domains with untruthful announcements. The planner is similar to the systems EFP or EFP2.0 in that it is a forward-search planner and can deal with unlimited nested beliefs and common knowledge by employing a Kripke based state representation and implementing an update model based transition function. Different from EFP, EFP3.0 employs a specification language that uses edge-conditioned update models for reasoning about effects of actions in multi-agent domains. We describe the basics of EFP3.0 and conduct experimental evaluations of the system against state-of-the-art epistemic planners. We discuss potential improvements that could be useful for scalability and efficiency of the system. 
    more » « less
  5. Natural language generators for taskoriented dialogue must effectively realize system dialogue actions and their associated semantics. In many applications, it is also desirable for generators to control the style of an utterance. To date, work on task-oriented neural generation has primarily focused on semantic fidelity rather than achieving stylistic goals, while work on style has been done in contexts where it is difficult to measure content preservation. Here we present three different sequence-to-sequence models and carefully test how well they disentangle content and style. We use a statistical generator, PERSONAGE, to synthesize a new corpus of over 88,000 restaurant domain utterances whose style varies according to models of personality, giving us total control over both the semantic content and the stylistic variation in the training data. We then vary the amount of explicit stylistic supervision given to the three models. We show that our most explicit model can simultaneously achieve high fidelity to both semantic and stylistic goals: this model adds a context vector of 36 stylistic parameters as input to the hidden state of the encoder at each time step, showing the benefits of explicit stylistic supervision, even when the amount of training data is large. 
    more » « less