skip to main content

Title: Human-Guided Modality Informativeness for Affective States
This paper studies the hypothesis that not all modalities are always needed to predict affective states. We explore this hypothesis in the context of recognizing three affective states that have shown a relation to a future onset of depression: positive, aggressive, and dysphoric. In particular, we investigate three important modali- ties for face-to-face conversations: vision, language, and acoustic modality. We first perform a human study to better understand which subset of modalities people find informative, when recog- nizing three affective states. As a second contribution, we explore how these human annotations can guide automatic affect recog- nition systems to be more interpretable while not degrading their predictive performance. Our studies show that humans can reliably annotate modality informativeness. Further, we observe that guided models significantly improve interpretability, i.e., they attend to modalities similarly to how humans rate the modality informative- ness, while at the same time showing a slight increase in predictive performance.
Authors:
; ; ; ;
Award ID(s):
1721667
Publication Date:
NSF-PAR ID:
10384092
Journal Name:
ACM International Conference on Multimodal Interaction
Page Range or eLocation-ID:
728 to 734
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Image texture, the relative spatial arrangement of intensity values in an image, encodes valuable information about the scene. As it stands, much of this potential information remains untapped. Understanding how to decipher textural details would afford another method of extracting knowledge of the physical world from images. In this work, we attempt to bridge the gap in research between quantitative texture analysis and the visual perception of textures. The impact of changes in image texture on human observer’s ability to perform signal detection and localization tasks in complex digital images is not understood. We examine this critical question by studying task-based human observer performance in detecting and localizing signals in tomographic breast images. We have also investigated how these changes impact the formation of second-order image texture. We used digital breast tomosynthesis (DBT) an FDA approved tomographic X-ray breast imaging method as the modality of choice to show our preliminary results. Our human observer studies involve localization ROC (LROC) studies for low contrast mass detection in DBT. Simulated images are used as they offer the benefit of known ground truth. Our results prove that changes in system geometry or processing leads to changes in image texture magnitudes. We showmore »that the variations in several well-known texture features estimated in digital images correlate with human observer detection–localization performance for signals embedded in them. This insight can allow efficient and practical techniques to identify the best imaging system design and algorithms or filtering tools by examining the changes in these texture features. This concept linking texture feature estimates and task based image quality assessment can be extended to several other imaging modalities and applications as well. It can also offer feedback in system and algorithm designs with a goal to improve perceptual benefits. Broader impact can be in wide array of areas including imaging system design, image processing, data science, machine learning, computer vision, perceptual and vision science. Our results also point to the caution that must be exercised in using these texture features as image-based radiomic features or as predictive markers for risk assessment as they are sensitive to system or image processing changes.

    « less
  2. Humans routinely extract important information from images and videos, relying on their gaze. In contrast, computational systems still have difficulty annotating important visual information in a human-like manner, in part because human gaze is often not included in the modeling process. Human input is also particularly relevant for processing and interpreting affective visual information. To address this challenge, we captured human gaze, spoken language, and facial expressions simultaneously in an experiment with visual stimuli characterized by subjective and affective content. Observers described the content of complex emotional images and videos depicting positive and negative scenarios and also their feelings about the imagery being viewed. We explore patterns of these modalities, for example by comparing the affective nature of participant-elicited linguistic tokens with image valence. Additionally, we expand a framework for generating automatic alignments between the gaze and spoken language modalities for visual annotation of images. Multimodal alignment is challenging due to their varying temporal offset. We explore alignment robustness when images have affective content and whether image valence influences alignment results. We also study if word frequency-based filtering impacts results, with both the unfiltered and filtered scenarios performing better than baseline comparisons, and with filtering resulting in a substantial decreasemore »in alignment error rate. We provide visualizations of the resulting annotations from multimodal alignment. This work has implications for areas such as image understanding, media accessibility, and multimodal data fusion.« less
  3. Abstract

    Most of the research in the field of affective computing has focused on detecting and classifying human emotions through electroencephalogram (EEG) or facial expressions. Designing multimedia content to evoke certain emotions has been largely motivated by manual rating provided by users. Here we present insights from the correlation of affective features between three modalities namely, affective multimedia content, EEG, and facial expressions. Interestingly, low-level Audio-visual features such as contrast and homogeneity of the video and tone of the audio in the movie clips are most correlated with changes in facial expressions and EEG. We also detect the regions associated with the human face and the brain (in addition to the EEG frequency bands) that are most representative of affective responses. The computational modeling between the three modalities showed a high correlation between features from these regions and user-reported affective labels. Finally, the correlation between different layers of convolutional neural networks with EEG and Face images as input provides insights into human affection. Together, these findings will assist in (1) designing more effective multimedia contents to engage or influence the viewers, (2) understanding the brain/body bio-markers of affection, and (3) developing newer brain-computer interfaces as well as facial-expression-based algorithms tomore »read emotional responses of the viewers.

    « less
  4. Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). More specifically, this is due to the fact that pre-trained models don’t have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in themore »NLP community.« less
  5. Mitrovic, A. ; Bosch, N. (Ed.)
    Emoji are commonly used in social media to convey attitudes and emotions. While popular, their use in educational contexts has been sparsely studied. This paper reports on the students’ use of emoji in an online course forum in which students annotate and discuss course material in the margins of the online textbook. For this study, instructors created 11 custom emoji-hashtag pairs that enabled students to quickly communicate affects and reactions in the forum that they experienced while interacting with the course material. Example reporting includes, inviting discussion about a topic, declaring a topic as interesting, or requesting assistance about a topic. We analyze emoji usage by over 1,800 students enrolled in multiple offerings of the same course across multiple academic terms. The data show that some emoji frequently appear together in posts associated with the same paragraphs, suggesting that students use the emoji in this way to communicating complex affective states. We explore the use of computational models for predicting emoji at the post level, even when posts are lacking emoji. This capability can allow instructors to infer information about students’ affective states during their ”at home” interactions with course readings. Finally, we show that partitioning the emoji into distinctmore »groups, rather than trying to predict individual emoji, can be both of pedagogical value to instructors and improve the predictive performance of our approach using the BERT language model. Our procedure can be generalized to other courses and for the benefit of other instructors.« less