Speech emotion recognition (SER) is a challenging task due to the limited availability of real-world labeled datasets. Since it is easier to find unlabeled data, the use of self-supervised learning (SSL) has become an attractive alternative. This study proposes new pre-text tasks for SSL to improve SER. While our target application is SER, the proposed pre-text tasks include audio-visual formulations, leveraging the relationship between acoustic and facial features. Our proposed approach introduces three new unimodal and multimodal pre-text tasks that are carefully designed to learn better representations for predicting emotional cues from speech. Task 1 predicts energy variations (high or low) from a speech sequence. Task 2 uses speech features to predict facial activation (high or low) based on facial landmark movements. Task 3 performs a multi-class emotion recognition task on emotional labels obtained from combinations of action units (AUs) detected across a video sequence. We pre-train a network with 60.92 hours of unlabeled data, fine-tuning the model for the downstream SER task. The results on the CREMA-D dataset show that the model pre-trained on the proposed domain-specific pre-text tasks significantly improves the precision (up to 5.1%), recall (up to 4.5%), and F1-scores (up to 4.9%) of our SER system.
more »
« less
This content will become publicly available on November 24, 2025
On Learning Discriminative Features from Synthesized Data for Self-supervised Fine-Grained Visual Recognition
Self-Supervised Learning (SSL) has become a prominent approach for acquiring visual representations across various tasks, yet its application in fine-grained visual recognition (FGVR) is challenged by the intricate task of distinguishing subtle differences between categories. To overcome this, we introduce an novel strategy that boosts SSL's ability to extract critical discriminative features vital for FGVR. This approach creates synthesized data pairs to guide the model to focus on discriminative features critical for FGVR during SSL. We start by identifying non-discriminative features using two main criteria: features with low variance that fail to effectively separate data and those deemed less important by Grad-CAM induced from the SSL loss. We then introduce perturbations to these non-discriminative features while preserving discriminative ones. A decoder is employed to reconstruct images from both perturbed and original feature vectors to create data pairs. An encoder is trained on such generated data pairs to become invariant to variations in non-discriminative dimensions while focusing on discriminative features, thereby improving the model's performance in FGVR tasks. We demonstrate the promising FGVR performance of the proposed approach through extensive evaluation on a wide variety of datasets.
more »
« less
- Award ID(s):
- 1956313
- PAR ID:
- 10593477
- Publisher / Repository:
- 2024 European Conference on Computer Vision (ECCV)
- Date Published:
- Page Range / eLocation ID:
- 101 to 117
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent \emph{non-contrastive} SSL (e.g., BYOL and SimSiam) show remarkable performance {\it without} negative pairs, with an extra learnable predictor and a stop-gradient operation. A fundamental question rises: why they do not collapse into trivial representation? In this paper, we answer this question via a simple theoretical study and propose a novel approach, \ourmethod{}, that \emph{directly} sets the linear predictor based on the statistics of its inputs, rather than trained with gradient update. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms linear predictor by 2.5 in 300-epoch training (and 5 in 60-epoch). \ourmethod{} is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Code is released.more » « less
-
Survival analysis is a crucial statistical technique used to estimate the anticipated duration until a specific event occurs. However, current methods often involve discretizing the time scale and struggle with managing absent features within the data. This becomes especially pertinent since events can transpire at any given point, rendering event analysis a continuous concern. Additionally, the presence of missing attributes within tabular data is widespread. By leveraging recent developments of Transformer and Self-Supervised Learning (SSL), we introduce SSL-SurvFormer. This entails a continuously monotonic Transformer network, empowered by SSL pre-training, that is designed to address the challenges presented by continuous events and absent features in survival prediction. Our proposed continuously monotonic Transformer model facilitates accurate estimation of survival probabilities, thereby bypassing the need for temporal discretization. Additionally, our SSL pre-training strategy incorporates data transformation to adeptly manage missing information. The SSL pre-training encompasses two tasks: mask prediction, which identifies positions of absent features, and reconstruction, which endeavors to recover absent elements based on observed ones. Our empirical evaluations conducted across a variety of datasets, including FLCHAIN, METABRIC, and SUPPORT, consistently highlight the superior performance of SSL-SurvFormer in comparison to existing methods. Additionally, SSL-SurvFormer demonstrates effectiveness in handling missing values, a critical aspect often encountered in real-world datasets.more » « less
-
Self-supervised learning (SSL) aims to learn meaningful representations from unlabeled data. Orthogonal Low-rank Embedding (OLE) shows promise for SSL by enhancing intra-class similarity in a low-rank subspace and promoting inter-class dissimilarity in a high-rank subspace, making it particularly suitable for multi-view learning tasks. However, directly applying OLE to SSL poses significant challenges: (1) the virtually infinite number of "classes" in SSL makes achieving the OLE objective impractical, leading to representational collapse; and (2) low-rank constraints may fail to distinguish between positively and negatively correlated features, further undermining learning. To address these issues, we propose SSOLE (Self-Supervised Orthogonal Low-rank Embedding), a novel framework that integrates OLE principles into SSL by (1) decoupling the low-rank and high-rank enforcement to align with SSL objectives; and (2) applying low-rank constraints to feature deviations from their mean, ensuring better alignment of positive pairs by accounting for the signs of cosine similarities. Our theoretical analysis and empirical results demonstrate that these adaptations are crucial to SSOLE’s effectiveness. Moreover, SSOLE achieves competitive performance across SSL benchmarks without relying on large batch sizes, memory banks, or dual-encoder architectures, making it an efficient and scalable solution for self-supervised tasks. Code is available at https://github.com/husthuaan/ssole.more » « less
-
Deep neural networks are efficient learning machines which leverage upon a large amount of manually labeled data for learning discriminative features. However, acquiring substantial amount of supervised data, especially for videos can be a tedious job across various computer vision tasks. This necessitates learning of visual features from videos in an unsupervised setting. In this paper, we propose a computationally simple, yet effective, framework to learn spatio-temporal feature embedding from unlabeled videos. We train a Convolutional 3D Siamese network using positive and negative pairs mined from videos under certain probabilistic assumptions. Experimental results on three datasets demonstrate that our proposed framework is able to learn weights which can be used for same as well as cross dataset and tasks.more » « less
An official website of the United States government
