skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, April 12 until 2:00 AM ET on Saturday, April 13 due to maintenance. We apologize for the inconvenience.

Title: Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling
A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition (SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages.  more » « less
Award ID(s):
2016719 1453781
Author(s) / Creator(s):
Date Published:
Journal Name:
IEEE Transactions on Affective Computing
Page Range / eLocation ID:
1 to 1
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Speech emotion recognition (SER) plays an important role in multiple fields such as healthcare, human-computer interaction (HCI), and security and defense. Emotional labels are often annotated at the sentence-level (i.e., one label per sentence), resulting in a sequence-to-one recognition problem. Traditionally, studies have relied on statistical descriptions, which are com- puted over time from low level descriptors (LLDs), creating a fixed dimension sentence-level feature representation regardless of the duration of the sentence. However sentence-level features lack temporal information, which limits the performance of SER systems. Recently, new deep learning architectures have been proposed to model temporal data. An important question is how to extract emotion-relevant features with temporal infor- mation. This study proposes a novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks. The approach is flexible, providing an ideal frame- work to combine gated network or attention mechanisms with long short-term memory (LSTM) networks. Our experimental results based on the MSP-Podcast dataset demonstrate that the proposed method not only significantly improves recognition accuracy over alternative temporal-based models relying on LSTM, but also leads to computational efficiency. 
    more » « less
  2. null (Ed.)
    With the requirements of natural language applications, multi-task sequence labeling methods have some immediate benefits over the single-task sequence labeling methods. Recently, many state-of-the-art multi-task sequence labeling methods were proposed, while still many issues to be resolved including (C1) exploring a more general relationship between tasks, (C2) extracting the task-shared knowledge purely and (C3) merging the task-shared knowledge for each task appropriately. To address the above challenges, we propose MTAA , a symmetric multi-task sequence labeling model, which performs an arbitrary number of tasks simultaneously. Furthermore, MTAA extracts the shared knowledge among tasks by adversarial learning and integrates the proposed multi-representation fusion attention mechanism for merging feature representations. We evaluate MTAA on two widely used data sets: CoNLL2003 and OntoNotes5.0. Experimental results show that our proposed model outperforms the latest methods on the named entity recognition and the syntactic chunking task by a large margin, and achieves state-of-the-art results on the part-of-speech tagging task. 
    more » « less
  3. Speech processing is highly incremental. It is widely accepted that human listeners continuously use the linguistic context to anticipate upcoming concepts, words, and phonemes. However, previous evidence supports two seemingly contradictory models of how a predictive context is integrated with the bottom-up sensory input: Classic psycholinguistic paradigms suggest a two-stage process, in which acoustic input initially leads to local, context-independent representations, which are then quickly integrated with contextual constraints. This contrasts with the view that the brain constructs a single coherent, unified interpretation of the input, which fully integrates available information across representational hierarchies, and thus uses contextual constraints to modulate even the earliest sensory representations. To distinguish these hypotheses, we tested magnetoencephalography responses to continuous narrative speech for signatures of local and unified predictive models. Results provide evidence that listeners employ both types of models in parallel. Two local context models uniquely predict some part of early neural responses, one based on sublexical phoneme sequences, and one based on the phonemes in the current word alone; at the same time, even early responses to phonemes also reflect a unified model that incorporates sentence-level constraints to predict upcoming phonemes. Neural source localization places the anatomical origins of the different predictive models in nonidentical parts of the superior temporal lobes bilaterally, with the right hemisphere showing a relative preference for more local models. These results suggest that speech processing recruits both local and unified predictive models in parallel, reconciling previous disparate findings. Parallel models might make the perceptual system more robust, facilitate processing of unexpected inputs, and serve a function in language acquisition. MEG Data MEG data is in FIFF format and can be opened with MNE-Python. Data has been directly converted from the acquisition device native format without any preprocessing. Events contained in the data indicate the stimuli in numerical order. Subjects R2650 and R2652 heard stimulus 11b instead of 11. Predictor Variables The original audio files are copyrighted and cannot be shared, but the make_audio folder contains which can be used to extract the exact clips from the commercially available audiobook (ISBN 978-1480555280). The predictors directory contains all the predictors used in the original study as pickled eelbrain objects. They can be loaded in Python with the eelbrain.load.unpickle function. The TextGrids directory contains the TextGrids aligned to the audio files. Source Localization The file contains files needed for source localization. Structural brain models used in the published analysis are reconstructed by scaling the FreeSurfer fsaverage brain (distributed with FreeSurfer) based on each subject's `MRI scaling parameters.cfg` file. This can be done using the `mne.scale_mri` function. Each subject's MEG folder contains a `subject-trans.fif` file which contains the coregistration between MEG sensor space and (scaled) MRI space, which is used to compute the forward solution. 
    more » « less
  4. Deep learning models have been studied to forecast human events using vast volumes of data, yet they still cannot be trusted in certain applications such as healthcare and disaster assistance due to the lack of interpretability. Providing explanations for event predictions not only helps practitioners understand the underlying mechanism of prediction behavior but also enhances the robustness of event analysis. Improving the transparency of event prediction models is challenging given the following factors: (i) multilevel features exist in event data which creates a challenge to cross-utilize different levels of data; (ii) features across different levels and time steps are heterogeneous and dependent; and (iii) static model-level interpretations cannot be easily adapted to event forecasting given the dynamic and temporal characteristics of the data. Recent interpretation methods have proven their capabilities in tasks that deal with graph-structured or relational data. In this paper, we present a Contextualized Multilevel Feature learning framework, CMF, for interpretable temporal event prediction. It consists of a predictor for forecasting events of interest and an explanation module for interpreting model predictions. We design a new context-based feature fusion method to integrate multiple levels of heterogeneous features. We also introduce a temporal explanation module to determine sequences of text and subgraphs that have crucial roles in a prediction. We conduct extensive experiments on several real-world datasets of political and epidemic events. We demonstrate that the proposed method is competitive compared with the state-of-the-art models while possessing favorable interpretation capabilities. 
    more » « less
  5. Cross-modal effects provide a model framework for investigating hierarchical inter-areal processing, particularly, under conditions where unimodal cortical areas receive contextual feedback from other modalities. Here, using complementary behavioral and brain imaging techniques, we investigated the functional networks participating in face and voice processing during gender perception, a high-level feature of voice and face perception. Within the framework of a signal detection decision model, Maximum likelihood conjoint measurement (MLCM) was used to estimate the contributions of the face and voice to gender comparisons between pairs of audio-visual stimuli in which the face and voice were independently modulated. Top–down contributions were varied by instructing participants to make judgments based on the gender of either the face, the voice or both modalities ( N = 12 for each task). Estimated face and voice contributions to the judgments of the stimulus pairs were not independent; both contributed to all tasks, but their respective weights varied over a 40-fold range due to top–down influences. Models that best described the modal contributions required the inclusion of two different top–down interactions: (i) an interaction that depended on gender congruence across modalities (i.e., difference between face and voice modalities for each stimulus); (ii) an interaction that depended on the within modalities’ gender magnitude. The significance of these interactions was task dependent. Specifically, gender congruence interaction was significant for the face and voice tasks while the gender magnitude interaction was significant for the face and stimulus tasks. Subsequently, we used the same stimuli and related tasks in a functional magnetic resonance imaging (fMRI) paradigm ( N = 12) to explore the neural correlates of these perceptual processes, analyzed with Dynamic Causal Modeling (DCM) and Bayesian Model Selection. Results revealed changes in effective connectivity between the unimodal Fusiform Face Area (FFA) and Temporal Voice Area (TVA) in a fashion that paralleled the face and voice behavioral interactions observed in the psychophysical data. These findings explore the role in perception of multiple unimodal parallel feedback pathways. 
    more » « less