skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Yin, Yufeng"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Individual variability of expressive behaviors is a major challenge for emotion recognition systems. Personalized emotion recognition strives to adapt machine learning models to individual behaviors, thereby enhancing emotion recognition performance and overcoming the limitations of generalized emotion recognition systems. However, existing datasets for audiovisual emotion recognition either have a very low number of data points per speaker or include a limited number of speakers. The scarcity of data significantly limits the development and assessment of personalized models, hindering their ability to effectively learn and adapt to individual expressive styles. This paper introduces EmoCeleb: a large-scale, weakly labeled emotion dataset generated via cross-modal labeling. EmoCeleb comprises over 150 hours of audiovisual content from approximately 1,500 speakers, with a median of 50 utterances per speaker. This rich dataset provides a rich resource for developing and benchmarking personalized emotion recognition methods, including those requiring substantial data per individual, such as set learning approaches. We also propose SetPeER: a novel personalized emotion recognition architecture employing set learning. SetPeER effectively captures individual expressive styles by learning representative speaker features from limited data, achieving strong performance with as few as eight utterances per speaker. By leveraging set learning, SetPeER overcomes the limitations of previous approaches that struggle to learn effectively from limited data per individual. Through extensive experiments on EmoCeleb and established benchmarks, i.e, MSP-Podcast and MSP-Improv, we demonstrate the effectiveness of our dataset and the superior performance of SetPeER compared to existing methods for emotion recognition. Our work paves the way for more robust and accurate personalized emotion recognition systems. 
    more » « less
    Free, publicly-accessible full text available January 1, 2026
  2. We present a database for automatic understanding of Social Engagement in MultiParty Interaction (SEMPI). Social engagement is an important social signal characterizing the level of participation of an interlocutor in a conversation. Social engagement involves maintaining attention and establishing connection and rapport. Machine understanding of social engagement can enable an autonomous agent to better understand the state of human participation and involvement to select optimal actions in human-machine social interaction. Recently, video-mediated interaction platforms, e.g., Zoom, have become very popular. The ease of use and increased accessibility of video calls have made them a preferred medium for multiparty conversations, including support groups and group therapy sessions. To create this dataset, we first collected a set of publicly available video calls posted on YouTube. We then segmented the videos by speech turn and cropped the videos to generate single-participant videos. We developed a questionnaire for assessing the level of social engagement by listeners in a conversation probing the relevant nonverbal behaviors for social engagement, including back-channeling, gaze, and expressions. We used Prolific, a crowd-sourcing platform, to annotate 3,505 videos of 76 listeners by three people, reaching a moderate to high inter-rater agreement of 0.693. This resulted in a database with aggregated engagement scores from the annotators. We developed a baseline multimodal pipeline using the state-of-the-art pre-trained models to track the level of engagement achieving the CCC score of 0.454. The results demonstrate the utility of the database for future applications in video-mediated human-machine interaction and human-human social skill assessment. Our dataset and code are available at https://github.com/ihp-lab/SEMPI. 
    more » « less
    Free, publicly-accessible full text available November 4, 2025
  3. There are individual differences in expressive behaviors driven by cultural norms and personality. This between-person variation can result in reduced emotion recognition performance. Therefore, personalization is an important step in improving the generalization and robustness of speech emotion recognition. In this paper, to achieve unsupervised personalized emotion recognition, we first pre-train an encoder with learnable speaker embeddings in a self-supervised manner to learn robust speech representations conditioned on speakers. Second, we propose an unsupervised method to compensate for the label distribution shifts by finding similar speakers and leveraging their label distributions from the training set. Extensive experimental results on the MSP-Podcast corpus indicate that our method consistently outperforms strong personalization baselines and achieves state-of-the-art performance for valence estimation. 
    more » « less