- PAR ID:
- 10477829
- Publisher / Repository:
- ISCA: International Speech Communication Assoc.
- Date Published:
- Journal Name:
- ISCA INTERSPEECH-2023
- Edition / Version:
- 1
- Page Range / eLocation ID:
- 984 to 988
- Subject(s) / Keyword(s):
- Index Terms: Wav2vec 2.0, Transformer, goodness of pronunciation, phoneme, prosody, suprasegmental.
- Format(s):
- Medium: X Size: 1MB
- Size(s):
- 1MB
- Location:
- Dublin, Ireland
- Sponsoring Org:
- National Science Foundation
More Like this
-
Various aspects of second language (L2) speakers’ pronunciation can be considered in the oral assessment of speaker proficiency. Over time, both segmentals and suprasegmentals have been examined for their roles in judgments of accented speech. Descriptors in the rating criteria often include speaker’s intelligibility (i.e., the actual understanding of the utterance) or comprehensibility (i.e., easy of understanding) (Derwing & Munro, 2005). This paper discusses the current issues and rating criteria in L2 pronunciation assessment, and describes the prominent characteristics of L2 intelligibility. It also offers recommendations to inform assessment practices and curriculum development in L2 classrooms in the context of Global Englishes.more » « less
-
Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only finetuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.more » « less
-
Abstract Issues of intelligibility may arise amongst English learners when acquiring new words and phrases in North American academic settings, perhaps in part due to limited linguistic data available to the learner for understanding language use patterns. To this end, this paper examines the effects of Data‐Driven Learning for Pronunciation (DDLfP) on lexical stress and prominence in the US academic context. 65 L2 English learners in North American universities completed a diagnostic and pretest with listening and speaking items before completing four online lessons and a posttest on academic words and formulas (i.e., multi‐word sequences). Experimental group participants (
n = 40) practiced using an audio corpus of highly proficient L2 speakers while comparison group participants (n = 25) were given teacher‐created pronunciation materials. Logistic regression results indicated that the group who used the corpus significantly increased their recognition of prominence in academic formulas. In the spoken tasks, both groups improved in their lexical stress pronunciation, but only the DDLfP learners improved their production of prominence in academic formulas. Learners reported that they valued DDLfP efforts for pronunciation learning across contexts and speakers. Findings have implications for teachers of L2 pronunciation and support the use of corpora for language teaching and learning. -
While a range of measures based on speech production, language, and perception are possible (Manun et al., 2020) for the prediction and estimation of speech intelligibility, what constitutes second language (L2) intelligibility remains under-defined. Prosodic and temporal features (i.e., stress, speech rate, rhythm, and pause placement) have been shown to impact listener perception (Kang et al., 2020). Still, their relationship with highly intelligible speech is yet unclear. This study aimed to characterize L2 speech intelligibility. Acoustic analyses, including PRAAT and Python scripts, were conducted on 405 speech samples (30 s) from 102 L2 English speakers with a wide variety of backgrounds, proficiency levels, and intelligibility levels. The results indicate that highly intelligible speakers of English employ between 2 and 4 syllables per second and that higher or lower speeds are less intelligible. Silent pauses between 0.3 and 0.8 s were associated with the highest levels of intelligibility. Rhythm, measured by Δ syllable length of all content syllables, was marginally associated with intelligibility. Finally, lexical stress accuracy did not interfere substantially with intelligibility until less than 70% of the polysyllabic words were incorrect. These findings inform the fields of first and second language research as well as language education and pathology.
-
Radek Skarnitzl & Jan Volín (Ed.)Unfamiliar native and non-native accents can cause word recognition challenges, particularly in noisy environments, but few studies have incorporated quantitative pronunciation distance metrics to explain intelligibility differences across accents. Here, intelligibility was measured for 18 talkers -- two from each of three native, one bilingual, and five non- native accents -- in three listening conditions (quiet and two noise conditions). Two variations of the Levenshtein pronunciation distance metric, which quantifies phonemic differences from a reference accent, were assessed for their ability to predict intelligibility. An unweighted Levenshtein distance metric was the best intelligibility predictor; talker accent further predicted performance. Accuracy did not fall along a native - non-native divide. Thus, phonemic differences from the listener’s home accent primarily determine intelligibility, but other accent- specific pronunciation features, including suprasegmental characteristics, must be quantified to fully explain intelligibility across talkers and listening conditions. These results have implications for pedagogical practices and speech perception theories.more » « less