Human communication is a collaborative process. Speakers, on top of conveying their own intent, adjust the content and language expressions by taking the listeners into account, including their knowledge background, personalities, and physical capabilities. Towards building AI agents with similar abilities in language communication, we propose Pragmatic Rational Speaker (PRS), a framework extending Rational Speech Act (RSA). The PRS attempts to learn the speaker-listener disparity and adjust the speech accordingly, by adding a light-weighted disparity adjustment layer into working memory on top of speaker’s long-term memory system. By fixing the long-term memory, the PRS only needs to update its working memory to learn and adapt to different types of listeners. To validate our framework, we create a dataset that simulates different types of speaker-listener disparities in the context of referential games. Our empirical results demonstrate that the PRS is able to shift its output towards the language that listeners are able to understand, significantly improve the collaborative task outcome.
more »
« less
This content will become publicly available on December 2, 2025
DIM: Dyadic Interaction Modeling for Social Behavior Generation
Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker’s voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers’ and listeners’ motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures.
more »
« less
- Award ID(s):
- 2211550
- PAR ID:
- 10620786
- Editor(s):
- Leonardis, Aleš; Ricci, Eliss; Roth, Stefan; Russakovsky, Olga; Sattler, Torsten; Varol, Gul
- Publisher / Repository:
- Springer Nature Switzerland
- Date Published:
- Page Range / eLocation ID:
- 484 to 503
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems.more » « less
-
This study examines messages that problem disclosers and supportive listeners enact during conversations about everyday stressors. We coded 402 dyadic interactions between strangers, friends, and romantic couples using Stiles’s (1992) verbal response modes (VRM) and Burleson’s (1982) verbal person centeredness (PC) typology to explore whether (a) listener and discloser utterances coalesce into types of speaking turns, (b) listener turn types vary in person-centered quality, (c) listener turns relate to discloser responses, and (d) discloser responses relate to listener turns. Analyses revealed a typology for both listener and discloser turns: acknowledgment, advisement, question, elaboration, hedged disclosure, and reflection. The relative proportion of those types varied as a function of conversational role and relationship context, and these speech acts varied only minimally in PC. Configural frequency analyses revealed four greater-than-chance contingencies across data sets. The discussion highlights implications for a dyadic and dynamic understanding of supportive communication.more » « less
-
Purpose The “bubble noise” technique has recently been introduced as a method to identify the regions in time–frequency maps (i.e., spectrograms) of speech that are especially important for listeners in speech recognition. This technique identifies regions of “importance” that are specific to the speech stimulus and the listener, thus permitting these regions to be compared across different listener groups. For example, in cross-linguistic and second-language (L2) speech perception, this method identifies differences in regions of importance in accomplishing decisions of phoneme category membership. This research note describes the application of bubble noise to the study of language learning for 3 different language pairs: Hindi English bilinguals' perception of the /v/–/w/ contrast in American English, native English speakers' perception of the tense/lax contrast for Korean fricatives and affricates, and native English speakers' perception of Mandarin lexical tone. Conclusion We demonstrate that this technique provides insight on what information in the speech signal is important for native/first-language listeners compared to nonnative/L2 listeners. Furthermore, the method can be used to examine whether L2 speech perception training is effective in bringing the listener's attention to the important cues.more » « less
-
Participants in a conversation must carefully monitor the turn-management (speaking and listening) willingness of other conversational partners and adjust their turn-changing behaviors accordingly to have smooth conversation. Many studies have focused on developing actual turn-changing (i.e., next speaker or end-of-turn) models that can predict whether turn-keeping or turn-changing will occur. Participants' verbal and non-verbal behaviors have been used as input features for predictive models. To the best of our knowledge, these studies only model the relationship between participant behavior and turn-changing. Thus, there is no model that takes into account participants' willingness to acquire a turn (turn-management willingness). In this paper, we address the challenge of building such models to predict the willingness of both speakers and listeners. Firstly, we find that dissonance exists between willingness and actual turn-changing. Secondly, we propose predictive models that are based on trimodal inputs, including acoustic, linguistic, and visual cues distilled from conversations. Additionally, we study the impact of modeling willingness to help improve the task of turn-changing prediction. To do so, we introduce a dyadic conversation corpus with annotated scores of speaker/listener turn-management willingness. Our results show that using all three modalities (i.e., acoustic, linguistic, and visual cues) of the speaker and listener is critically important for predicting turn-management willingness. Furthermore, explicitly adding willingness as a prediction task improves the performance of turn-changing prediction. Moreover, turn-management willingness prediction becomes more accurate when this joint prediction of turn-management willingness and turn-changing is performed by using multi-task learning techniques.more » « less
An official website of the United States government
