In order to understand multimodal interactions between humans or humans and machine, it is minimally necessary to identify the content of the agents’ communicative acts in the dialogue. This can involve either overt linguistic expressions (speech or writing), content-bearing gesture, or the integration of both. But this content must be interpreted relative to a deeper understanding of an agent’s Theory of Mind (one’s mental state, desires, and intentions) in the context of the dialogue as it dynamically unfolds. This, in turn, can require identifying and tracking nonverbal behaviors, such as gaze, body posture, facial expressions, and actions, all of which contribute to understanding how expressions are contextualized in the dialogue, and interpreted relative to the epistemic attitudes of each agent. In this paper, we adopt Generative Lexicon’s approach to event structure to provide a lexical semantics for ontic and epistemic actions as used in Bolander’s interpretation of Dynamic Epistemic Logic, called Lexical Event Modeling (LEM). This allows for the compositional construction of epistemic models of a dialogue state. We demonstrate how veridical and false belief scenarios are treated compositionally within this model.
more »
« less
Common Ground Tracking in Multimodal Dialogue
Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on “dialogue state tracking” (DST), which is the ability to update the representations of the speaker’s needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is “common ground tracking” (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and “questions under discussion” (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.
more »
« less
- Award ID(s):
- 2326985
- PAR ID:
- 10539987
- Publisher / Repository:
- ACL
- Date Published:
- Format(s):
- Medium: X
- Location:
- Turin, Italy
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Multimodal dialogue involving multiple participants presents complex computational challenges, primarily due to the rich interplay of diverse communicative modalities including speech, gesture, action, and gaze. These modalities interact in complex ways that traditional dialogue systems often struggle to accurately track and interpret. To address these challenges, we extend the textual enrichment strategy of Dense Paraphrasing (DP), by translating each nonverbal modality into linguistic expressions. By normalizing multimodal information into a language-based form, we hope to both simplify the representation for and enhance the computational understanding of situated dialogues. We show the effectiveness of the dense paraphrased language form by evaluating instruction-tuned Large Language Models (LLMs) against the Common Ground Tracking (CGT) problem using a publicly available collaborative problem-solving dialogue dataset. Instead of using multimodal LLMs, the dense paraphrasing technique represents the dialogue information from multiple modalities in a compact and structured machine-readable text format that can be directly processed by the language-only models. We leverage the capability of LLMs to transform machine-readable paraphrases into human-readable paraphrases, and show that this process can further improve the result on the CGT task. Overall, the results show that augmenting the context with dense paraphrasing effectively facilitates the LLMs' alignment of information from multiple modalities, and in turn largely improves the performance of common ground reasoning over the baselines. Our proposed pipeline with original utterances as input context already achieves comparable results to the baseline that utilized decontextualized utterances which contain rich coreference information. When also using the decontextualized input, our pipeline largely improves the performance of common ground reasoning over the baselines. We discuss the potential of DP to create a robust model that can effectively interpret and integrate the subtleties of multimodal communication, thereby improving dialogue system performance in real-world settings.more » « less
-
Collaborative tasks often begin with partial task knowledge and incomplete initial plans from each partner. To complete these tasks, agents need to engage in situated communication with their partners and coordinate their partial plans towards a complete plan to achieve a joint task goal. While such collaboration seems effortless in a human-human team, it is highly challenging for human-AI collaboration. To address this limitation, this paper takes a step towards collaborative plan acquisition, where humans and agents strive to learn and communicate with each other to acquire a complete plan for joint tasks. Specifically, we formulate a novel problem for agents to predict the missing task knowledge for themselves and for their partners based on rich perceptual and dialogue history. We extend a situated dialogue benchmark for symmetric collaborative tasks in a 3D blocks world and investigate computational strategies for plan acquisition. Our empirical results suggest that predicting the partner's missing knowledge is a more viable approach than predicting one's own. We show that explicit modeling of the partner's dialogue moves and mental states produces improved and more stable results than without. These results provide insight for future AI agents that can predict what knowledge their partner is missing and, therefore, can proactively communicate such information to help their partner acquire such missing knowledge toward a common understanding of joint tasks.more » « less
-
Social and Functional Pressures in Vocal Alignment: Differences for Human and Voice-AI Interlocutorsnull (Ed.)Increasingly, people are having conversational interactions with voice-AI systems, such as Amazon’s Alexa. Do the same social and functional pressures that mediate alignment toward human interlocutors also predict align patterns toward voice-AI? We designed an interactive dialogue task to investigate this question. Each trial consisted of scripted, interactive turns between a participant and a model talker (pre-recorded from either a natural production or voice-AI): First, participants produced target words in a carrier phrase. Then, a model talker responded with an utterance containing the target word. The interlocutor responses varied by 1) communicative affect (social) and 2) correctness (functional). Finally, participants repeated the carrier phrase. Degree of phonetic alignment was assessed acoustically between the target word in the model’s response and participants’ response. Results indicate that social and functional factors distinctly mediate alignment toward AI and humans. Findings are discussed with reference to theories of alignment and human-computer interaction.more » « less
-
Information that is shared across brains is encoded in idiosyncratic fine-scale functional topographies. Hyperalignment captures shared information by projecting pattern vectors for neural responses and connectivities into a common, high-dimensional information space, rather than by aligning topographies in a canonical anatomical space. Individual transformation matrices project information from individual anatomical spaces into the common model information space, preserving the geometry of pairwise dissimilarities between pattern vectors, and model cortical topography as mixtures of overlapping, individual-specific topographic basis functions, rather than as contiguous functional areas. The fundamental property of brain function that is preserved across brains is information content, rather than the functional properties of local features that support that content. In this Perspective, we present the conceptual framework that motivates hyperalignment, its computational underpinnings for joint modeling of a common information space and idiosyncratic cortical topographies, and discuss implications for understanding the structure of cortical functional architecture.more » « less
An official website of the United States government

