<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Deepemocluster: a Semi-Supervised Framework for Latent Cluster Representation of Speech Emotions</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/06/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10287545</idno>
					<idno type="doi">10.1109/ICASSP39728.2021.9414035</idno>
					<title level='j'>IEEE international conference on acoustics, speech and signal processing (ICASSP 2021)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Wei-Cheng Lin</author><author>Kusha Sridhar</author><author>Carlos Busso</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Semi-supervised learning (SSL) is an appealing approach to resolve generalization problem for speech emotion recognition (SER) systems. By utilizing large amounts of unlabeled data, SSL is able to gain extra information about the prior distribution of the data. Typically, it can lead to better and robust recognition performance. Existing SSL approaches for SER include variations of encoder-decoder model structures such as autoencoder (AE) and variational autoencoders (VAEs), where it is difficult to interpret the learning mechanism behind the latent space. In this study, we introduce a new SSL framework, which we refer to as the DeepEmoCluster framework, for attribute-based SER tasks. The DeepEmoCluster framework is an end-to-end model with mel-spectrogram inputs, which combines a self-supervised pseudo labeling classification network with a supervised emotional attribute regressor. The approach encourages the model to learn latent representations by maximizing the emotional separation of K-means clusters. Our experimental results based on the MSP-Podcast corpus indicate that the DeepEmoCluster framework achieves competitive prediction performances in fully supervised scheme, outperforming baseline methods in most of the conditions. The approach can be further improved by incorporating extra unlabeled set. Moreover, our experimental results explicitly show that the latent clusters have emotional dependencies, enriching the geometric interpretation of the clusters.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Recognizing human's emotional states is crucial to modern human computer interaction (HCI) systems <ref type="bibr">[1]</ref>. Speech emotion recognition (SER) also plays an important role in various fields such as education <ref type="bibr">[2]</ref> and healthcare <ref type="bibr">[3]</ref>. Although recent advances in deep learning approaches have led to high performance in areas such as image classification <ref type="bibr">[4]</ref> and automatic speech recognition (ASR) <ref type="bibr">[5]</ref>, SER is still a challenging problem with often poor generalization across different conditions (e.g., environment, recording settings or speakers) <ref type="bibr">[6,</ref><ref type="bibr">7]</ref>. One of the major factors leading to low performance is the availability of the amount of training data. Comparing to image or speech recognition tasks, the sizes of speech emotional corpora are relatively small (e.g., IEMOCAP corpus for SER <ref type="bibr">[8]</ref> &#8776; 12 hrs versus the LibriSpeech corpus for ASR <ref type="bibr">[9]</ref> &#8776; 1,000 hrs). To alleviate this limitation, studies have investigated domain adversarial techniques that utilize information from multiple corpora. Studies have explored either adopting additional domain classifier with reverse gradient layers <ref type="bibr">[10]</ref> or a critic network <ref type="bibr">[11]</ref> to obtain a more generalized intermediate representation, reducing mismatches from different domains. Other studies have also explored data augmentation methods to improve robustness of SER models. More specifically, generative adversarial networks (GANs) have been applied to artificially synthesize emotional samples to reduce the sparsity in the distribution of the train set <ref type="bibr">[12,</ref><ref type="bibr">13]</ref>. However, these adversarial training approaches may suffer from unstable convergence issues or be limited by the size of the datasets. Another appealing approach is semi-supervised learning (SSL). The key concept of SSL is to leverage large amounts of unlabeled data to improve the robustness and performance of the supervised task <ref type="bibr">[14]</ref>. Unlike labeled data, the collection of unlabeled resources is often easy and inexpensive. Therefore, it is easy to obtain a large dataset of unlabeled data that augments the often reduced labeled set. SSL methods learn additional structure of the input distribution by using this unlabeled data <ref type="bibr">[15]</ref>. One of the approaches for utilizing unlabeled data in SSL is to make use of the cluster assumption, which says that if two data samples in the input space belong to the same cluster, they are likely to belong to the same class or region in the target space. DeepCluster <ref type="bibr">[16]</ref> follows this assumption, achieving great success in extracting discriminative features from completely unsupervised representation learning. It utilizes self-supervised training technique to learn cluster-based pseudo class labels, which achieves competitive performance compared to supervised learning models.</p><p>Inspired by the DeepCluster framework, this study proposes a novel formulation for SER, where we modify the original unsupervised structure into a semi-supervised framework. The approach derives emotional speech clusters, namely -DeepEmoCluster. Specifically, we implement an additional supervised emotional attributebased regressor (i.e., arousal, dominance and valence) to jointly train with the unsupervised cluster classifier, encouraging the model to learn emotionally discriminative contents under a maximum latent clusters separation constraint. The proposed model is an end-toend (E2E) convolutional neural network (CNN) architecture (i.e., VGG-16 <ref type="bibr">[17]</ref>), where the input feature is a 128D-mel spectrogram. We evaluate our proposed framework using the MSP-Podcast corpus <ref type="bibr">[18]</ref>, using the concordance correlation coefficient (CCC) as the metric to evaluate model performance. The results show that DeepE-moCluster framework achieves competitive prediction performances under fully supervised scheme for arousal, valence and dominance, outperforming baseline models most of the time. We find that the DeepEmoCluster framework can further improve the prediction performance while using the semi-supervised setting, where reinforcing the information gains from the unlabeled data is helpful for the supervised task. As part of the evaluation, we explore the optimal number of clusters needed in the DeepEmoCluster framework. The results indicate that the number of clusters should be fine-tuned depending on the size of the unlabeled set and the emotional attribute. Finally, our latent cluster analysis demonstrates that the DeepEmo-Cluster approach creates latent clusters that depend on the emotional content, enriching the geometric interpretation of the clusters.</p><p>The key contribution of this paper is the new SSL DeepEmo-Cluster framework for the SER task, which is a semi-supervised variant of the DeepCluster framework <ref type="bibr">[16]</ref>. By leveraging unlabeled data and jointly training with supervised networks, the DeepEmo-Cluster approach improves attribute-based SER model performance with explicit geometric interpretations in the latent representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">RELATED WORK</head><p>One of the recent popular trends in SER is to build E2E learning systems. E2E models do not require predefined handcrafted features and can directly extract emotionally relevant features from either time domain waveforms <ref type="bibr">[19]</ref>, or frequency domain raw spectrograms <ref type="bibr">[20]</ref>. These approaches often rely on deep neural networks (DNNs). Satt et al. <ref type="bibr">[20]</ref> demonstrated that an E2E SER model with raw spectrogram inputs could be easily combined with a noisy reduction solution (e.g., harmonic filtering), enabling SER systems to resist low SNR environments. Li et al. <ref type="bibr">[21]</ref> proposed a multitask model with self-attention mechanism based on spectrogram features. Their method achieved state-of-the-art performance on the IEMO-CAP dataset <ref type="bibr">[8]</ref>. Trigeorgis et al. <ref type="bibr">[19]</ref> built their E2E model from raw waveforms using a CNN-BLSTM structure. The key component was the 1D convolutions operating on the discrete-time domain waveforms, which could be considered a feature refinement system to remove background noise. Their results showed that features derived from the E2E model outperforms handcrafted acoustic features (e.g., eGeMAPS <ref type="bibr">[22]</ref>). In this study, we also employ an E2E model with mel-spectrogram inputs to better represent emotional cues.</p><p>Various studies have explored SSL and latent representation learning approaches to utilize unlabeled data for SER tasks. Deng et al. <ref type="bibr">[23]</ref> presented the semi-supervised autoencoder (SSAE), which combined an unsupervised deep denoising autoencoder (DDAE) with a supervised learning objective. They introduced an extra class label for the unlabeled data, forcing models to learn a bottleneck latent representation by incorporating prior information from unlabelled data. Following a similar concept, Latif et al. <ref type="bibr">[24]</ref> proposed a variational autoencoders (VAEs) framework for deriving a latent representation from speech signals. In contrast to DDAE, VAEs learns a probability distribution representing the inputs in a latent space instead of compressing them in the bottleneck layer. Parthasarathy and Busso <ref type="bibr">[25,</ref><ref type="bibr">26]</ref> adopted the ladder networks framework (&#915;-model in Rasmus et al. <ref type="bibr">[27]</ref>) to improve attributebased SER performance. The &#915;-model imposes consistency regularization on skip connections between noisy encoder and decoder, aiming to obtain invariant intermediate representations toward noise perturbations. These approaches rely on the encoder-decoder structure, where the bottleneck latent representation was not trained to explicitly involve strong geometric properties by pulling apart data with different emotional content. Therefore, it is difficult to understand the mechanism behind the latent space or the bottleneck layer. This study aims to create a meaningful latent representation using a SSL approach based on the DeepCluster framework <ref type="bibr">[16]</ref>, which imposes strong geometric constraints (i.e., maximum emotional separation between clusters) without requiring decoder networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">RESOURCES</head><p>This study uses the MSP-Podcast corpus <ref type="bibr">[18]</ref>, which consists of emotionally rich spontaneous speech recordings collected from publicly available podcasts under creative commons license. The content of the recordings cover various topics such as interviews, sports, academic talks, entertainment and politics. We process these recordings through a speaker diarization tool to segment the podcasts into smaller speaking turns of length between 2.75 and 11 seconds in duration. We employ a number of pre-processing steps to select speaking turns with high signal to noise ratio (SNR) and single speaker content without overlapped speech. We remove turns with background noise, music and telephone quality speech. To balance the emotional content of the corpus and have emotionally rich sentences, we run speech segments through emotional retrieval algorithms based on the strategy suggested in Mariooryad et al. <ref type="bibr">[28]</ref>.</p><p>This study uses version 1.6 of the corpus which consists of 50,362 sentences (83h 29m). We use amazon mechanical turk (AMT) to annotate the sentences in the corpus using a variation of the crowdsourcing protocol discussed in Burmania et al. <ref type="bibr">[29]</ref>. The sentences are annotated for their primary and secondary emotional content (categorical classes), as well as the emotional attributes arousal (calm versus active), valence (negative versus positive) and dominance (weak versus strong). This study uses emotional attributes, formulating the SER task as a regression problem. Each annotator assessed the emotional content with a seven point likerttype scale using self-assessment manikins (SAMs). Each sentence in the corpus has five or more annotations and the ground-truth labels are obtained by averaging the scores across subjects. The test set has 10,124 samples from 50 speakers, the development set has 5,958 samples from 40 speakers, and the train set has 34,280 samples from the rest of the speakers. This partition aims to keep speaker independent sets. There are around 500, 000 additional speech segments that have not been retrieved or annotated. These speaking turns form our unlabeled data in the corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">PROPOSED DEEPEMOCLUSTER APPROACH</head><p>Figure <ref type="figure">1</ref> shows the proposed DeepEmoCluster approach, which is a SSL framework that creates meaningful emotional dependent clusters as a latent representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Acoustic Features and Pre-processing</head><p>We use the toolkit librosa <ref type="bibr">[30]</ref> to extract the 128D-mel spectrogram as the input acoustic representation to our end-to-end SER model. First, the magnitude spectrogram of the waveform signal is calculated with a window size of 32 ms (512 sample points), with 16 ms overlap between windows. Then, we map the magnitude spectrogram into 128D mel-scale filters. We perform z-normalization on these features, where the mean and standard deviation used in the normalization are estimated over the train set. After obtaining the normalized 128D-mel spectrogram for each sentence, we split them into smaller data chunks (i.e., sub-images of the original spectrogram). This study follows the chunk segmentation approach proposed by Lin and Busso <ref type="bibr">[31]</ref> that splits different duration sentences into a fixed number of chunks with fixed duration. We achieve this segmentation by dynamically adjusting the overlap between chunks per sentence. The parameters of this segmentation are the desired chunk length wc, and the maximum duration of a sentence in the dataset Tmax. Based on these parameters, Equation 1 defines the number of chunks C per sentence. Equation 2 provides the target chunk step size &#8710;ci for a given sentence i with duration Ti. Then, we can split a sentence into fixed C chunks without relying on zeropaddings. Figure <ref type="figure">2</ref> shows a visual example of the chunking process.</p><p>The input of our model is the split sub-images of the spectrogram, which have the same image size. During training, the same sentence-level emotional label is assigned to all the sub-images obtained from original spectrogram. While more sophisticated meth- ods can be used to combine the chunk-based decisions <ref type="bibr">[31]</ref>, in this study we directly average the prediction outputs of the sub-images to derive a final prediction for the sentence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">The DeepEmoCluster Framework</head><p>The proposed semi-supervised DeepEmoCluster framework consists of three networks: 1) the feature encoder fenc(&#8226;), 2) the cluster classifier f cltr (&#8226;), and 3) the emotional regressor femo(&#8226;) (Fig. <ref type="figure">1</ref>). The first network of the DeepEmoCluster approach is the feature extractor, which extracts discriminative information from the chunks. This block is implemented with the VGG-16 architecture <ref type="bibr">[17]</ref> with batch normalization. The CNN-based network extracts the chunk-level feature representations hj from the input melspectrograms xj. The second network is the cluster classifier, which is an unsupervised (or self-supervised) network that classifies pseudo-class labels given by the K-means clustering results based on the latent feature space. The number of classes depends on the number of K-means clusters and these clustering pseudo labels are re-assigned after every training epoch. The third network is the emotional regressor, which is a supervised regression network that predicts the target emotional attribute (i.e., arousal, dominance and valence). The addition of this supervised network brings emotional dependencies into the latent space.</p><p>The semi-supervised training scheme of the proposed DeepE-moCluster framework is divided into two stages within a single training epoch. First, data points from the unlabeled set are considered for the cluster classifier (i.e., unsupervised path). During this step, we freeze the femo(&#8226;) model and the gradients only backpropagate through the fenc(&#8226;) and f cltr (&#8226;) networks. Second, the data-label pairs from the labeled set are jointly used to backpropagate the gradients to the entire model. During this training stage, we consider the network as a multitask learning model with the loss function given in Equation <ref type="formula">3</ref>, where CCC is the concordance correlation coefficient (CCC) loss for the regression task, and CE is the cross entropy (CE) loss computed for the self-supervised cluster classifier. The parameter &#955; controls the importance of the unsupervised task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">EXPERIMENTAL RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental Settings</head><p>We split the 128D mel-spectrogram feature of every sentence into C = 11 sub-images by setting the desired chunk window size wc to 1 sec (Eq. 1). Note that the maximum sentence duration of the MSP-Podcast dataset is 11 secs <ref type="bibr">[18]</ref>. The detailed architecture of the feature encoder fenc(&#8226;) model is shown in Table <ref type="table">1</ref>, which follows the VGG-16 structure <ref type="bibr">[17]</ref>. The CNN-blocks in Table <ref type="table">1</ref> consist of a 2D-CNN with BatchNorm and ReLU activation function.</p><p>The cluster classifier and emotional regressor models have the same structure, but they are constructed using fully connected layers  (ReLU activation) and a task-dependent output layer (i.e., softmax output for the cluster classifier and linear combination output for the emotional regressor). The weighting factor &#955; of the loss function (Eq. 3) is set to 1 and the cluster number of K-means is a finetuned parameter depending on the size of the dataset. We discuss its value in Section 5.3. We train our model with a batch size of 64, with Adam optimizer for the emotional regressor (lr=0.0005), and stochastic gradient descent (SGD) optimizer for the feature encoder and cluster classifier (lr=0.001). We save the best models with an early stopping criterion based on the validation loss. All models are implemented with PyTorch. We compare three baseline models in this study: CNN-regressor, CNN-AE and CNN-VAE. The CNN-regressor is a regular CNN-based emotional regression model. More specifically, the model contains the feature encoder and emotional regressor networks, which can only be applied to fully supervised learning. Instead of using the cluster classifier (i.e., pseudo labeling), the CNN-AE and CNN-VAE are attached with an additional decoder network to reconstruct the inputs as an unsupervised task. By combining a supervised task with the autoencoder, these two baselines can also be extended to semi-supervised frameworks providing appropriate baselines for our proposed DeepEmoCluster framework. The decoder network mirrors the layers of the encoder (i.e., the feature encoder in Table <ref type="table">1</ref>), where the transposed convolutional layer is used to reconstruct the original input feature maps. The major difference between CNN-AE and CNN-VAE is that the CNN-AE baseline directly reconstructs the outputs from the bottleneck, whereas, the CNN-VAE baseline reconstructs the input from the reparametrized latent vectors. We evaluate the model performance using CCC. We randomly split the original test set into four subsets with the same size and run five trails for all models, reporting the average based on these results (i.e., 4 subsets &#215; 5 trails = 20 results per model per attribute). We implement this strategy to conduct statistical analysis using a two-tailed T-test over the 20 results. We assert statistical significance when p-value &lt;0.05.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Evaluation with Fully Supervised Learning</head><p>In this section, we focus on comparing fully supervised performance of the proposed DeepEmoCluster framework with the three baseline models. All the models are trained with the labeled set without considering any unlabeled data. We fix the number of clusters to 10 (i.e., 10-clusters) for all emotional attributes in the DeepEmoCluster framework. Table <ref type="table">2</ref> shows the CCC performances of all models for the three emotional attributes. DeepEmoCluster achieves the best prediction scores for arousal and dominance, which significantly outperforms all the baseline models. Although the valence result of the DeepEmoCluster does not reach the highest performance, it still obtains competitive results. These results indicate that discretizing latent representation by encouraging maximum separations between data points can be beneficial to a continuous regression task, since it reduces the complexity of the learned latent space which is constrained under discrete clusters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Evaluation of Unlabeled Set Size and Number of Clusters</head><p>Since the proposed DeepEmoCluster framework is a semi-supervised framework, we can utilize unlabeled data to further improve the model performance. Table <ref type="table">3</ref> reports the CCC performances, as we add more unlabeled data (0, 15K and 40K unlabeled segments) under a semi-supervised training scheme. We use 10 clusters for the DeepEmoCluster approach. Table <ref type="table">3</ref> shows that training with additional unlabeled data leads to improved model prediction results, especially for the valence attribute. This result validates the effectiveness of our SSL approach to better represent features by utilizing unlabeled data information.</p><p>We notice that while we increase the size of the unlabeled set, the performance does not keep significantly increasing. Therefore, we hypothesize that the clusters' number depends on the size of the unlabeled set. To validate our hypothesis about the relation between the clusters' number and the size of the unlabeled set, we fix the unlabeled data to 40K segments. Then, we evaluate the approach with different number of clusters. Table <ref type="table">4</ref> shows the increasing trends in the prediction performance as we increase the number of clusters in the DeepEmoCluster framework for dominance and valence. However, the prediction for arousal achieves the best result with only 10-clusters. This result suggests that the number of clusters is a finetuning parameter that depends on the size of unlabeled set and the emotional attribute.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Cluster Analysis</head><p>The DeepEmoCluster framework possesses high geometric interpretation, which enables us to explicitly show the mechanism behind the learned emotional latent space contributing to performance improvements. One approach to validate whether the emotional content plays a role in the latent clusters is to visually compare the emotional   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">CONCLUSIONS</head><p>We presented the DeepEmoCluster framework, a new semi-supervised approach that learns better latent representations for SER from labeled and unlabeled sets. The DeepEmoCluster approach is able to construct a latent feature space based on clusters that depend on the emotional content. We achieve this goal by adding a supervised emotional regression task. The maximum latent cluster constraint during the joint training procedure enables our approach to achieve the best prediction scores for arousal and dominance, and competitive performance for valence under a fully supervised training scheme. By applying a semi-supervised training strategy, the DeepEmoCluster approach can further improve the model performance, reaching the best overall recognition accuracy for emotional attributes. A future work of this study is to strengthen the connections between the latent clusters and the target emotions. We can adopt additional information theoretic-based loss function such as mutual information, directly associating the cluster classifier and emotional regressor. Another promising future direction is to build a multimodal DeepEmoCluster based on video, audio and language, forming meaningful behavioral emotional clusters.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>This study was funded by the National Science Foundation (NSF) under grant CNS-2016719 and CAREER IIS-1453781. Github linkhttps://github.com/winston-lin-wei-cheng/DeepEmoClusters</p></note>
		</body>
		</text>
</TEI>
