skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


Title: Phonetic Anchor-Based Transfer Learning to Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition
Modeling cross-lingual speech emotion recognition (SER) has become more prevalent because of its diverse applications. Existing studies have mostly focused on technical approaches that adapt the feature, domain, or label across languages, without considering in detail the similarities be- tween the languages. This study focuses on domain adaptation in cross-lingual scenarios using phonetic constraints. This work is framed in a twofold manner. First, we analyze emotion-specific phonetic commonality across languages by identifying common vowels that are useful for SER modeling. Second, we leverage these common vowels as an anchoring mechanism to facilitate cross-lingual SER. We consider American English and Taiwanese Mandarin as a case study to demonstrate the potential of our approach. This work uses two in-the-wild natural emotional speech corpora: MSP-Podcast (American English), and BIIC-Podcast (Taiwanese Mandarin). The proposed unsupervised cross-lingual SER model using these phonetical anchors outperforms the baselines with a 58.64% of unweighted average recall (UAR).  more » « less
Award ID(s):
2016719
PAR ID:
10441289
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023)
Page Range / eLocation ID:
1 to 5
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. na (Ed.)
    The field of speech emotion recognition (SER) aims to create scientifically rigorous systems that can reliably characterize emotional behaviors expressed in speech. A key aspect for building SER systems is to obtain emotional data that is both reliable and reproducible for practitioners. However, academic researchers encounter difficulties in accessing or collecting naturalistic, large-scale, reliable emotional recordings. Also, the best practices for data collection are not necessarily described or shared when presenting emotional corpora. To address this issue, the paper proposes the creation of an affective naturalistic database consortium (AndC) that can encourage multidisciplinary cooperation among researchers and practitioners in the field of affective computing. This paper’s contribution is twofold. First, it proposes the design of the AndC with a customizable-standard framework for intelligently-controlled emotional data collection. The focus is on leveraging naturalistic spontaneous record- ings available on audio-sharing websites. Second, it presents as a case study the development of a naturalistic large-scale Taiwanese Mandarin podcast corpus using the customizable- standard intelligently-controlled framework. The AndC will en- able research groups to effectively collect data using the provided pipeline and to contribute with alternative algorithms or data collection protocols. 
    more » « less
  2. Abstract

    Multilingual speakers can find speech recognition in everyday environments like restaurants and open-plan offices particularly challenging. In a world where speaking multiple languages is increasingly common, effective clinical and educational interventions will require a better understanding of how factors like multilingual contexts and listeners’ language proficiency interact with adverse listening environments. For example, word and phrase recognition is facilitated when competing voices speak different languages. Is this due to a “release from masking” from lower-level acoustic differences between languages and talkers, or higher-level cognitive and linguistic factors? To address this question, we created a “one-man bilingual cocktail party” selective attention task using English and Mandarin speech from one bilingual talker to reduce low-level acoustic cues. In Experiment 1, 58 listeners more accurately recognized English targets when distracting speech was Mandarin compared to English. Bilingual Mandarin–English listeners experienced significantly more interference and intrusions from the Mandarin distractor than did English listeners, exacerbated by challenging target-to-masker ratios. In Experiment 2, 29 Mandarin–English bilingual listeners exhibited linguistic release from masking in both languages. Bilinguals experienced greater release from masking when attending to English, confirming an influence of linguistic knowledge on the “cocktail party” paradigm that is separate from primarily energetic masking effects. Effects of higher-order language processing and expertise emerge only in the most demanding target-to-masker contexts. The “one-man bilingual cocktail party” establishes a useful tool for future investigations and characterization of communication challenges in the large and growing worldwide community of Mandarin–English bilinguals.

     
    more » « less
  3. Southern American English is spoken in a large geographic region in the United States. Its characteristics include back-vowel fronting (e.g., in goose, foot, and goat), which has been ongoing since the mid-nineteenth century; meanwhile, the low back vowels (in lot and thought) have recently merged in some areas. We investigate these five vowels in the Digital Archive of Southern Speech, a legacy corpus of linguistic interviews with sixty-four speakers born 1886-1956. We extracted 89,367 vowel tokens and used generalized additive mixed-effects models to test for socially-driven changes to both their relative phonetic placements and the shapes of their formant trajectories. Our results reinforce previous descriptions of Southern vowels while contributing additional phonetic detail about their trajectories. Goose-fronting is a change in progress, with greatest fronting after coronal consonants. Goat is quite dynamic; it lowers and fronts in apparent time. Generally, women have more fronted realizations than men. Foot is largely monophthongal, and stable across time. Lot and thought are distinct and unmerged, occupying different regions of the vowel space. While their relative positions change across generations, all five vowels show a remarkable consistency in formant trajectory shapes across time. This study’s results reveal social and phonetic details about the back vowels of Southerners born in the late nineteenth and early twentieth centuries: goose-fronting was well underway, goat-fronting was beginning, but foot remained backed, and the low back vowels were unmerged.

     
    more » « less
  4. Africa has over 2000 indigenous languages but they are under-represented in NLP research due to lack of datasets. In recent years, there have been progress in developing labelled corpora for African languages. However, they are often available in a single domain and may not generalize to other domains. In this paper, we focus on the task of sentiment classification for cross-domain adaptation. We create a new dataset, NollySenti—based on the Nollywood movie reviews for five languages widely spoken in Nigeria (English, Hausa, Igbo, Nigerian-Pidgin, and Yorùbá). We provide an extensive empirical evaluation using classical machine learning methods and pre-trained language models. Leveraging transfer learning, we compare the performance of cross-domain adaptation from Twitter domain, and cross-lingual adaptation from English language. Our evaluation shows that transfer from English in the same target domain leads to more than 5% improvement in accuracy compared to transfer from Twitter in the same language. To further mitigate the domain difference, we leverage machine translation (MT) from English to other Nigerian languages, which leads to a further improvement of 7% over cross-lingual evaluation. While MT to low-resource languages are often of low quality, through human evaluation, we show that most of the translated sentences preserve the sentiment of the original English reviews. 
    more » « less
  5. Abstract

    This study examines the putative benefits of explicit phonetic instruction, high variability phonetic training, and their effects on adult nonnative speakers’ Mandarin tone productions. Monolingual first language (L1) English speakers (n= 80), intermediate second language (L2) Mandarin learners (n= 40), and L1 Mandarin speakers (n= 40) took part in a multiday Mandarin‐like artificial language learning task. Participants were asked to repeat a syllable–tone combination immediately after hearing it. Half of all participants were exposed to speech from 1 talker (low variability) while the other half heard speech from 4 talkers (high variability). Half of the L1 English participants were given daily explicit instruction on Mandarin tone contours, while the other half were not. Tone accuracy was measured by L1 Mandarin raters (n= 104) who classified productions according to their perceived tonal category. Explicit instruction of tone contours facilitated L1 English participants’ production of rising and falling tone contours. High variability input alone had no main effect on participants’ productions but interacted with explicit instruction to improve participants’ productions of high‐level tone contours. These results motivate an L2 tone production training approach that consists of explicit tone instruction followed by gradual exposure to more variable speech.

     
    more » « less