Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
na (Ed.)The field of speech emotion recognition (SER) aims to create scientifically rigorous systems that can reliably characterize emotional behaviors expressed in speech. A key aspect for building SER systems is to obtain emotional data that is both reliable and reproducible for practitioners. However, academic researchers encounter difficulties in accessing or collecting naturalistic, large-scale, reliable emotional recordings. Also, the best practices for data collection are not necessarily described or shared when presenting emotional corpora. To address this issue, the paper proposes the creation of an affective naturalistic database consortium (AndC) that can encourage multidisciplinary cooperation among researchers and practitioners in the field of affective computing. This paper’s contribution is twofold. First, it proposes the design of the AndC with a customizable-standard framework for intelligently-controlled emotional data collection. The focus is on leveraging naturalistic spontaneous record- ings available on audio-sharing websites. Second, it presents as a case study the development of a naturalistic large-scale Taiwanese Mandarin podcast corpus using the customizable- standard intelligently-controlled framework. The AndC will en- able research groups to effectively collect data using the provided pipeline and to contribute with alternative algorithms or data collection protocols.more » « less
-
Modeling cross-lingual speech emotion recognition (SER) has become more prevalent because of its diverse applications. Existing studies have mostly focused on technical approaches that adapt the feature, domain, or label across languages, without considering in detail the similarities be- tween the languages. This study focuses on domain adaptation in cross-lingual scenarios using phonetic constraints. This work is framed in a twofold manner. First, we analyze emotion-specific phonetic commonality across languages by identifying common vowels that are useful for SER modeling. Second, we leverage these common vowels as an anchoring mechanism to facilitate cross-lingual SER. We consider American English and Taiwanese Mandarin as a case study to demonstrate the potential of our approach. This work uses two in-the-wild natural emotional speech corpora: MSP-Podcast (American English), and BIIC-Podcast (Taiwanese Mandarin). The proposed unsupervised cross-lingual SER model using these phonetical anchors outperforms the baselines with a 58.64% of unweighted average recall (UAR).more » « less
-
Advancing speech emotion recognition (SER) de- pends highly on the source used to train the model, i.e., the emotional speech corpora. By permuting different design parameters, researchers have released versions of corpora that attempt to provide a better-quality source for training SER. In this work, we focus on studying communication modes of collection. In particular, we analyze the patterns of emotional speech collected during interpersonal conversations or monologues. While it is well known that conversation provides a better protocol for eliciting authentic emotion expressions, there is a lack of systematic analyses to determine whether conversational speech provide a “better-quality” source. Specifically, we examine this research question from three perspectives: perceptual differences, acoustic variability and SER model learning. Our analyses on the MSP- Podcast corpus show that: 1) rater’s consistency for conversation recordings is higher when evaluating categorical emotions, 2) the perceptions and acoustic patterns observed on conversations have properties that are better aligned with expected trends discussed in emotion literature, and 3) a more robust SER model can be trained from conversational data. This work brings initial evidences stating that samples of conversations may provide a better-quality source than samples from monologues for building a SER model.more » « less