Understanding why certain individuals work well (or poorly) together as a team is a key research focus in the psychological and behavioral sciences and a fundamental problem for team-based organizations. Nevertheless, we have a limited ability to predict the social and work-related dynamics that will emerge from a given combination of team members. In this work, we model vocal turn-taking behavior within conversations as a parametric stochastic process on a network composed of the team members. More precisely, we model the dynamic of exchanging the `speaker token' among team members as a random walk in a graph that is driven by both individual level features and the conversation history. We fit our model to conversational turn-taking data extracted from audio recordings of multinational student teams during undergraduate engineering design internships. Through this real-world data we validate the explanatory power of our model and we unveil statistically significant differences in speaking behaviors between team members of different nationalities.
more »
« less
Multimodal Turn Analysis and Prediction for Multi-party Conversations
This paper presents a computational study to analyze and predict turns (i.e., turn-taking and turn-keeping) in multiparty conversations. Specifically, we use a high-fidelity hybrid data acquisition system to capture a large-scale set of multi-modal natural conversational behaviors of interlocutors in three-party conversations, including gazes, head movements, body movements, speech, etc. Based on the inter-pausal units (IPUs) extracted from the in-house acquired dataset, we propose a transformer-based computational model to predict the turns based on the interlocutor states (speaking/back-channeling/silence) and the gaze targets. Our model can robustly achieve more than 80% accuracy, and the generalizability of our model was extensively validated through cross-group experiments. Also, we introduce a novel computational metric called “relative engagement level" (REL) of IPUs, and further validate its statistical significance between turn-keeping IPUs and turn-taking IPUs, and between different conversational groups. Our experimental results also found that the patterns of the interlocutor states can be used as a more effective cue than their gaze behaviors for predicting turns in multiparty conversations.
more »
« less
- Award ID(s):
- 2005430
- PAR ID:
- 10532343
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400700552
- Page Range / eLocation ID:
- 436 to 444
- Subject(s) / Keyword(s):
- Multi-party conversations conversational gesture understanding Multimodal interaction Machine learning Human-human interaction Empirical studies
- Format(s):
- Medium: X
- Location:
- Paris France
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Participants in a conversation must carefully monitor the turn-management (speaking and listening) willingness of other conversational partners and adjust their turn-changing behaviors accordingly to have smooth conversation. Many studies have focused on developing actual turn-changing (i.e., next speaker or end-of-turn) models that can predict whether turn-keeping or turn-changing will occur. Participants' verbal and non-verbal behaviors have been used as input features for predictive models. To the best of our knowledge, these studies only model the relationship between participant behavior and turn-changing. Thus, there is no model that takes into account participants' willingness to acquire a turn (turn-management willingness). In this paper, we address the challenge of building such models to predict the willingness of both speakers and listeners. Firstly, we find that dissonance exists between willingness and actual turn-changing. Secondly, we propose predictive models that are based on trimodal inputs, including acoustic, linguistic, and visual cues distilled from conversations. Additionally, we study the impact of modeling willingness to help improve the task of turn-changing prediction. To do so, we introduce a dyadic conversation corpus with annotated scores of speaker/listener turn-management willingness. Our results show that using all three modalities (i.e., acoustic, linguistic, and visual cues) of the speaker and listener is critically important for predicting turn-management willingness. Furthermore, explicitly adding willingness as a prediction task improves the performance of turn-changing prediction. Moreover, turn-management willingness prediction becomes more accurate when this joint prediction of turn-management willingness and turn-changing is performed by using multi-task learning techniques.more » « less
-
We present a database for automatic understanding of Social Engagement in MultiParty Interaction (SEMPI). Social engagement is an important social signal characterizing the level of participation of an interlocutor in a conversation. Social engagement involves maintaining attention and establishing connection and rapport. Machine understanding of social engagement can enable an autonomous agent to better understand the state of human participation and involvement to select optimal actions in human-machine social interaction. Recently, video-mediated interaction platforms, e.g., Zoom, have become very popular. The ease of use and increased accessibility of video calls have made them a preferred medium for multiparty conversations, including support groups and group therapy sessions. To create this dataset, we first collected a set of publicly available video calls posted on YouTube. We then segmented the videos by speech turn and cropped the videos to generate single-participant videos. We developed a questionnaire for assessing the level of social engagement by listeners in a conversation probing the relevant nonverbal behaviors for social engagement, including back-channeling, gaze, and expressions. We used Prolific, a crowd-sourcing platform, to annotate 3,505 videos of 76 listeners by three people, reaching a moderate to high inter-rater agreement of 0.693. This resulted in a database with aggregated engagement scores from the annotators. We developed a baseline multimodal pipeline using the state-of-the-art pre-trained models to track the level of engagement achieving the CCC score of 0.454. The results demonstrate the utility of the database for future applications in video-mediated human-machine interaction and human-human social skill assessment. Our dataset and code are available at https://github.com/ihp-lab/SEMPI.more » « less
-
Effective storytelling relies on engagement and interaction. This work develops an automated software platform for telling stories to children and investigates the impact of two design choices on children’s engagement and willingness to interact with the system: story distribution and the use of complex gesture. A storyteller condition compares stories told in a third person, narrator voice with those distributed between a narrator and first-person story characters. Basic gestures are used in all our storytellings, but, in a second factor, some are augmented with gestures that indicate conversational turn changes, references to other characters and prompt children to ask questions. An analysis of eye gaze indicates that children attend more to the story when a distributed storytelling model is used. Gesture prompts appear to encourage children to ask questions, something that children did, but at a relatively low rate. Interestingly, the children most frequently asked “why” questions. Gaze switching happened more quickly when the story characters began to speak than for narrator turns. These results have implications for future agent-based storytelling system research.more » « less
-
null (Ed.)Effective storytelling relies on engagement and interaction. This work develops an automated software platform for telling stories to children and investigates the impact of two design choices on children’s engagement and willingness to interact with the system: story distribution and the use of complex gesture. A storyteller condition compares stories told in a third person, narrator voice with those distributed between a narrator and first-person story characters. Basic gestures are used in all our storytellings, but, in a second factor, some are augmented with gestures that indicate conversational turn changes, references to other characters and prompt children to ask questions. An analysis of eye gaze indicates that children attend more to the story when a distributed storytelling model is used. Gesture prompts appear to encourage children to ask questions, something that children did, but at a relatively low rate. Interestingly, the children most frequently asked “why” questions. Gaze switching happened more quickly when the story characters began to speak than for narrator turns. These results have implications for future agent-based storytelling system research.more » « less
An official website of the United States government

