In this paper, we propose a deep multimodal fusion network to fuse multiple modalities (face, iris, and fingerprint) for person identification. The proposed deep multimodal fusion algorithm consists of multiple streams of modality-specific Convolutional Neural Networks (CNNs), which are jointly optimized at multiple feature abstraction levels. Multiple features are extracted at several different convolutional layers from each modality-specific CNN for joint feature fusion, optimization, and classification. Features extracted at different convolutional layers of a modality-specific CNN represent the input at several different levels of abstract representations. We demonstrate that an efficient multimodal classification can be accomplished with a significant reductionmore »
Exploration of Acoustic and Lexical Cues for the INTERSPEECH 2020 Computational Paralinguistic Challenge
In this paper, we investigate various acoustic features and lexical features for the INTERSPEECH 2020 Computational Paralinguistic Challenge. For the acoustic analysis, we show that the proposed FV-MFCC feature is very promising, which has very strong prediction power on its own, and can also provide complementary information when fused with other acoustic features. For the lexical representation, we find that the corpus-dependent TF.IDF feature is by far the best representation. We also explore several model fusion techniques to combine different modalities together, and propose novel SVM models to aggregate the chunk-level predictions to the narrative-level predictions based on the chunk-level decision functionals. Finally we discuss the potential for improving prediction by combining the lexical and acoustic modalities together, and we find that fusion of lexical and acoustic modalities do not lead to consistent improvements over elderly Arousal, but substantially improve over the Valence. Our methods significantly outperform the official baselines on the test set in the participated Mask and Elderly Sub-challenges. We obtain an UAR of 75.1%, 54.3%, and 59.0% on the Mask, Elderly Arousal and Valence prediction tasks respectively.
- Award ID(s):
- 2034791
- Publication Date:
- NSF-PAR ID:
- 10282648
- Journal Name:
- INTERSPEECH 2020
- Page Range or eLocation-ID:
- 2092 to 2096
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Regularization plays a key role in improving the prediction of emotions using attributes such as arousal, valence and dominance. Regularization is particularly important with deep neural networks (DNNs), which have millions of parameters. While previous studies have reported competitive performance for arousal and dominance, the prediction results for valence using acoustic features are significantly lower. We hypothesize that higher regularization can lead to better results for valence. This study focuses on exploring the role of dropout as a form of regularization for valence, suggesting the need for higher regularization. We analyze the performance of regression models for valence, arousal andmore »
-
Deep learning models have been studied to forecast human events using vast volumes of data, yet they still cannot be trusted in certain applications such as healthcare and disaster assistance due to the lack of interpretability. Providing explanations for event predictions not only helps practitioners understand the underlying mechanism of prediction behavior but also enhances the robustness of event analysis. Improving the transparency of event prediction models is challenging given the following factors: (i) multilevel features exist in event data which creates a challenge to cross-utilize different levels of data; (ii) features across different levels and time steps are heterogeneousmore »
-
Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). More specifically, this is due to the fact that pre-trained models don’t have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment tomore »
-
Analyzing different modalities of expression can provide insights into the ways that humans interpret, label, and react to images. Such insights have the potential not only to advance our understanding of how humans coordinate these expressive modalities but also to enhance existing methodologies for common AI tasks such as image annotation and classification. We conducted an experiment that co-captured the facial expressions, eye movements, and spoken language data that observers produce while examining images of varying emotional content and responding to description-oriented vs. affect-oriented questions about those images. We analyzed the facial expressions produced by the observers in order tomore »