skip to main content

Title: Generative Multimodal Models of Nonverbal Synchrony in Close Relationships
Positive interpersonal relationships require shared understanding along with a sense of rapport. A key facet of rapport is mirroring and convergence of facial expression and body language, known as nonverbal synchrony. We examined nonverbal synchrony in a study of 29 heterosexual romantic couples, in which audio, video, and bracelet accelerometer were recorded during three conversations. We extracted facial expression, body movement, and acoustic-prosodic features to train neural network models that predicted the nonverbal behaviors of one partner from those of the other. Recurrent models (LSTMs) outperformed feed-forward neural networks and other chance baselines. The models learned behaviors encompassing facial responses, speech-related facial movements, and head movement. However, they did not capture fleeting or periodic behaviors, such as nodding, head turning, and hand gestures. Notably, a preliminary analysis of clinical measures showed greater association with our model outputs than correlation of raw signals. We discuss potential uses of these generative models as a research tool to complement current analytical methods along with real-world applications (e.g., as a tool in therapy).
; ; ; ;
Award ID(s):
1745442 1660894
Publication Date:
Journal Name:
The 13th IEEE International Conference on Automatic Face and Gesture Recognition
Page Range or eLocation-ID:
195 to 202
Sponsoring Org:
National Science Foundation
More Like this
  1. Displaying emotional states is an important part of nonverbal communication that can facilitate successful interactions. Facial expressions have been studied for their emotional expression, but this work looks at the capacity of body movements to convey different emotions. This work first generates a large set of nonverbal behaviors with a variety of torso and arm properties on a humanoid robot, Quori. Participants in a user study evaluated how much each movement displayed each of eight different emotions. Results indicate that specific movement properties are associated with particular emotions; such as leaning backward and arms held high displaying surprise and leaning forward displaying sadness. Understanding the emotions associated with certain movements can allow for the design of more appropriate behaviors during interactions with humans and could improve people’s perception of the robot.
  2. Abstract

    This study focuses on the individual and joint contributions of two nonverbal channels (i.e., face and upper body) in avatar mediated-virtual environments. 140 dyads were randomly assigned to communicate with each other via platforms that differentially activated or deactivated facial and bodily nonverbal cues. The availability of facial expressions had a positive effect on interpersonal outcomes. More specifically, dyads that were able to see their partner’s facial movements mapped onto their avatars liked each other more, formed more accurate impressions about their partners, and described their interaction experiences more positively compared to those unable to see facial movements. However, the latter was only true when their partner’s bodily gestures were also available and not when only facial movements were available. Dyads showed greater nonverbal synchrony when they could see their partner’s bodily and facial movements. This study also employed machine learning to explore whether nonverbal cues could predict interpersonal attraction. These classifiers predicted high and low interpersonal attraction at an accuracy rate of 65%. These findings highlight the relative significance of facial cues compared to bodily cues on interpersonal outcomes in virtual environments and lend insight into the potential of automatically tracked nonverbal cues to predict interpersonal attitudes.

  3. This research work explores different machine learning techniques for recognizing the existence of rapport between two people engaged in a conversation, based on their facial expressions. First using artificially generated pairs of correlated data signals, a coupled gated recurrent unit (cGRU) neural network is developed to measure the extent of similarity between the temporal evolution of pairs of time-series signals. By pre-selecting their covariance values (between 0.1 and 1.0), pairs of coupled sequences are generated. Using the developed cGRU architecture, this covariance between the signals is successfully recovered. Using this and various other coupled architectures, tests for rapport (measured by the extent of mirroring and mimicking of behaviors) are conducted on real-life datasets. On fifty-nine (N = 59) pairs of interactants in an interview setting, a transformer based coupled architecture performs the best in determining the existence of rapport. To test for generalization, the models were applied on never-been-seen data collected 14 years prior, also to predict the existence of rapport. The coupled transformer model again performed the best for this transfer learning task, determining which pairs of interactants had rapport and which did not. The experiments and results demonstrate the advantages of coupled architectures for predicting an interactional processmore »such as rapport, even in the presence of limited data.« less
  4. Raynal, Ann M. ; Ranney, Kenneth I. (Ed.)
    Most research in technologies for the Deaf community have focused on translation using either video or wearable devices. Sensor-augmented gloves have been reported to yield higher gesture recognition rates than camera-based systems; however, they cannot capture information expressed through head and body movement. Gloves are also intrusive and inhibit users in their pursuit of normal daily life, while cameras can raise concerns over privacy and are ineffective in the dark. In contrast, RF sensors are non-contact, non-invasive and do not reveal private information even if hacked. Although RF sensors are unable to measure facial expressions or hand shapes, which would be required for complete translation, this paper aims to exploit near real-time ASL recognition using RF sensors for the design of smart Deaf spaces. In this way, we hope to enable the Deaf community to benefit from advances in technologies that could generate tangible improvements in their quality of life. More specifically, this paper investigates near real-time implementation of machine learning and deep learning architectures for the purpose of sequential ASL signing recognition. We utilize a 60 GHz RF sensor which transmits a frequency modulation continuous wave (FMWC waveform). RF sensors can acquire a unique source of information that ismore »inaccessible to optical or wearable devices: namely, a visual representation of the kinematic patterns of motion via the micro-Doppler signature. Micro-Doppler refers to frequency modulations that appear about the central Doppler shift, which are caused by rotational or vibrational motions that deviate from principle translational motion. In prior work, we showed that fractal complexity computed from RF data could be used to discriminate signing from daily activities and that RF data could reveal linguistic properties, such as coarticulation. We have also shown that machine learning can be used to discriminate with 99% accuracy the signing of native Deaf ASL users from that of copysigning (or imitation signing) by hearing individuals. Therefore, imitation signing data is not effective for directly training deep models. But, adversarial learning can be used to transform imitation signing to resemble native signing, or, alternatively, physics-aware generative models can be used to synthesize ASL micro-Doppler signatures for training deep neural networks. With such approaches, we have achieved over 90% recognition accuracy of 20 ASL signs. In natural environments, however, near real-time implementations of classification algorithms are required, as well as an ability to process data streams in a continuous and sequential fashion. In this work, we focus on extensions of our prior work towards this aim, and compare the efficacy of various approaches for embedding deep neural networks (DNNs) on platforms such as a Raspberry Pi or Jetson board. We examine methods for optimizing the size and computational complexity of DNNs for embedded micro-Doppler analysis, methods for network compression, and their resulting sequential ASL recognition performance.« less
  5. The overall goal of our research is to develop a system of intelligent multimodal affective pedagogical agents that are effective for different types of learners (Adamo et al., 2021). While most of the research on pedagogical agents tends to focus on the cognitive aspects of online learning and instruction, this project explores the less-studied role of affective (or emotional) factors. We aim to design believable animated agents that can convey realistic, natural emotions through speech, facial expressions, and body gestures and that can react to the students’ detected emotional states with emotional intelligence. Within the context of this goal, the specific objective of the work reported in the paper was to examine the extent to which the agents’ facial micro-expressions affect students’ perception of the agents’ emotions and their naturalness. Micro-expressions are very brief facial expressions that occur when a person either deliberately or unconsciously conceals an emotion being felt (Ekman &Friesen, 1969). Our assumption is that if the animated agents display facial micro expressions in addition to macro expressions, they will convey higher expressive richness and naturalness to the viewer, as “the agents can possess two emotional streams, one based on interaction with the viewer and the other basedmore »on their own internal state, or situation” (Queiroz et al. 2014, p.2).The work reported in the paper involved two studies with human subjects. The objectives of the first study were to examine whether people can recognize micro-expressions (in isolation) in animated agents, and whether there are differences in recognition based on the agent’s visual style (e.g., stylized versus realistic). The objectives of the second study were to investigate whether people can recognize the animated agents’ micro-expressions when integrated with macro-expressions, the extent to which the presence of micro + macro-expressions affect the perceived expressivity and naturalness of the animated agents, the extent to which exaggerating the micro expressions, e.g. increasing the amplitude of the animated facial displacements affects emotion recognition and perceived agent naturalness and emotional expressivity, and whether there are differences based on the agent’s design characteristics. In the first study, 15 participants watched eight micro-expression animations representing four different emotions (happy, sad, fear, surprised). Four animations featured a stylized agent and four a realistic agent. For each animation, subjects were asked to identify the agent’s emotion conveyed by the micro-expression. In the second study, 234 participants watched three sets of eight animation clips (24 clips in total, 12 clips per agent). Four animations for each agent featured the character performing macro-expressions only, four animations for each agent featured the character performing macro- + micro-expressions without exaggeration, and four animations for each agent featured the agent performing macro + micro-expressions with exaggeration. Participants were asked to recognize the true emotion of the agent and rate the emotional expressivity ad naturalness of the agent in each clip using a 5-point Likert scale. We have collected all the data and completed the statistical analysis. Findings and discussion, implications for research and practice, and suggestions for future work will be reported in the full paper. ReferencesAdamo N., Benes, B., Mayer, R., Lei, X., Meyer, Z., &Lawson, A. (2021). Multimodal Affective Pedagogical Agents for Different Types of Learners. In: Russo D., Ahram T., Karwowski W., Di Bucchianico G., Taiar R. (eds) Intelligent Human Systems Integration 2021. IHSI 2021. Advances in Intelligent Systems and Computing, 1322. Springer, Cham., P., &Friesen, W. V. (1969, February). Nonverbal leakage and clues to deception. Psychiatry, 32(1), 88–106. Queiroz, R. B., Musse, S. R., &Badler, N. I. (2014). Investigating Macroexpressions and Microexpressions in Computer Graphics Animated Faces. Presence, 23(2), 191-208.

    « less