skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Combining Multimodal Analyses of Students' Emotional and Cognitive States to Understand Their Learning Behaviors
The incorporation of technology into primary and secondary education has facilitated the creation of curricula that utilize computational tools for problem-solving. In Open-Ended Learning Environments (OELEs), students participate in learning-by- modeling activities that enhance their understanding of (Science, technology, engineering, and mathematics) STEM and computational concepts. This research presents an innovative multimodal emotion recognition approach that analyzes facial expressions and speech data to identify pertinent learning-centered emotions, such as engagement, delight, confusion, frustration, and boredom. Utilizing sophisticated machine learning algorithms, including High-Speed Face Emotion Recognition (HSEmotion) model for visual data and wav2vec 2.0 for auditory data, our method is refined with a modality verification step and a fusion layer for accurate emotion classification. The multimodal technique significantly increases emotion detection accuracy, with an overall accuracy of 87%, and an Fl -score of 84%. The study also correlates these emotions with model building strategies in collaborative settings, with statistical analyses indicating distinct emotional patterns associated with effective and ineffective strategy use for tasks model construction and debugging tasks. These findings underscore the role of adaptive learning environments in fostering students' emotional and cognitive development.  more » « less
Award ID(s):
2112635
PAR ID:
10637578
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
International Conference on Computers in Education
Date Published:
Journal Name:
International Conference on Computers in Education
ISSN:
3078-4360
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Accessibility efforts for d/Deaf and hard of hearing (DHH) learners in video-based learning have mainly focused on captions and interpreters, with limited attention to learners' emotional awareness--an important yet challenging skill for effective learning. Current emotion technologies are designed to support learners' emotional awareness and social needs; however, little is known about whether and how DHH learners could benefit from these technologies. Our study explores how DHH learners perceive and use emotion data from two collection approaches, self-reported and automatic emotion recognition (AER), in video-based learning. By comparing the use of these technologies between DHH (N=20) and hearing learners (N=20), we identified key differences in their usage and perceptions: 1) DHH learners enhanced their emotional awareness by rewatching the video to self-report their emotions and called for alternative methods for self-reporting emotion, such as using sign language or expressive emoji designs; and 2) while the AER technology could be useful for detecting emotional patterns in learning experiences, DHH learners expressed more concerns about the accuracy and intrusiveness of the AER data. Our findings provide novel design implications for improving the inclusiveness of emotion technologies to support DHH learners, such as leveraging DHH peer learners' emotions to elicit reflections. 
    more » « less
  2. null (Ed.)
    Expressive behaviors conveyed during daily interactions are difficult to determine, because they often consist of a blend of different emotions. The complexity in expressive human communication is an important challenge to build and evaluate automatic systems that can reliably predict emotions. Emotion recognition systems are often trained with limited databases, where the emotions are either elicited or recorded by actors. These approaches do not necessarily reflect real emotions, creating a mismatch when the same emotion recognition systems are applied to practical applications. Developing rich emotional databases that reflect the complexity in the externalization of emotion is an important step to build better models to recognize emotions. This study presents the MSP-Face database, a natural audiovisual database obtained from video-sharing websites, where multiple individuals discuss various topics expressing their opinions and experiences. The natural recordings convey a broad range of emotions that are difficult to obtain with other alternative data collection protocols. A feature of the corpus is the addition of two sets. The first set includes videos that have been annotated with emotional labels using a crowd-sourcing protocol (9,370 recordings – 24 hrs, 41 m). The second set includes similar videos without emotional labels (17,955 recordings – 45 hrs, 57 m), offering the perfect infrastructure to explore semi-supervised and unsupervised machine-learning algorithms on natural emotional videos. This study describes the process of collecting and annotating the corpus. It also provides baselines over this new database using unimodal (audio, video) and multimodal emotional recognition systems. 
    more » « less
  3. na (Ed.)
    The problem of predicting emotional attributes from speech has often focused on predicting a single value from a sentence or short speaking turn. These methods often ignore that natural emotions are both dynamic and dependent on context. To model the dynamic nature of emotions, we can treat the prediction of emotion from speech as a time-series problem. We refer to the problem of predicting these emotional traces as dynamic speech emotion recognition. Previous studies in this area have used models that treat all emotional traces as coming from the same underlying distribution. Since emotions are dependent on contextual information, these methods might obscure the context of an emotional interaction. This paper uses a neural process model with a segment-level speech emotion recognition (SER) model for this problem. This type of model leverages information from the time-series and predictions from the SER model to learn a prior that defines a distribution over emotional traces. Our proposed model performs 21% better than a bidirectional long short-term memory (BiLSTM) baseline when predicting emotional traces for valence. 
    more » « less
  4. Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD. 
    more » « less
  5. The growth of technologies promising to infer emotions raises political and ethical concerns, including concerns regarding their accuracy and transparency. A marginalized perspective in these conversations is that of data subjects potentially affected by emotion recognition. Taking social media as one emotion recognition deployment context, we conducted interviews with data subjects (i.e., social media users) to investigate their notions about accuracy and transparency in emotion recognition and interrogate stated attitudes towards these notions and related folk theories. We find that data subjects see accurate inferences as uncomfortable and as threatening their agency, pointing to privacy and ambiguity as desired design principles for social media platforms. While some participants argued that contemporary emotion recognition must be accurate, others raised concerns about possibilities for contesting the technology and called for better transparency. Furthermore, some challenged the technology altogether, highlighting that emotions are complex, relational, performative, and situated. In interpreting our findings, we identify new folk theories about accuracy and meaningful transparency in emotion recognition. Overall, our analysis shows an unsatisfactory status quo for data subjects that is shaped by power imbalances and a lack of reflexivity and democratic deliberation within platform governance. 
    more » « less