skip to main content

Abstract One of the basic goals of second language (L2) speech research is to understand the perception-production link, or the relationship between L2 speech perception and L2 speech production. Although many studies have examined the link, they have done so with strikingly different conceptual foci and methods. Even studies that appear to use similar perception and production tasks often present nontrivial differences in task characteristics and implementation. This conceptual and methodological variation makes meaningful synthesis of perception-production findings difficult, and it also complicates the process of developing new perception-production models that specifically address how the link changes throughout L2 learning. In this study, we scrutinize theoretical and methodological issues in perception-production research and offer recommendations for advancing theory and practice in this domain. We focus on L2 sound learning because most work in the area has focused on segmental contrasts.
Award ID(s):
Publication Date:
Journal Name:
Studies in Second Language Acquisition
Page Range or eLocation-ID:
1 to 26
Sponsoring Org:
National Science Foundation
More Like this
  1. Birdsong has long been a subject of extensive research in the fields of ethology as well as neuroscience. Neural and behavioral mechanisms underlying song acquisition and production in male songbirds are particularly well studied, mainly because birdsong shares some important features with human speech such as critical dependence on vocal learning. However, birdsong, like human speech, primarily functions as communication signals. The mechanisms of song perception and recognition should also be investigated to attain a deeper understanding of the nature of complex vocal signals. Although relatively less attention has been paid to song receivers compared to signalers, recent studies on female songbirds have begun to reveal the neural basis of song preference. Moreover, there are other studies of song preference in juvenile birds which suggest possible functions of preference in social context including the sensory phase of song learning. Understanding the behavioral and neural mechanisms underlying the formation, maintenance, expression, and alteration of such song preference in birds will potentially give insight into the mechanisms of speech communication in humans. To pursue this line of research, however, it is necessary to understand current methodological challenges in defining and measuring song preference. In addition, consideration of ultimate questions can also bemore »important for laboratory researchers in designing experiments and interpreting results. Here we summarize the current understanding of song preference in female and juvenile songbirds in the context of Tinbergen’s four questions, incorporating results ranging from ethological field research to the latest neuroscience findings. We also discuss problems and remaining questions in this field and suggest some possible solutions and future directions.« less
  2. Purpose The “bubble noise” technique has recently been introduced as a method to identify the regions in time–frequency maps (i.e., spectrograms) of speech that are especially important for listeners in speech recognition. This technique identifies regions of “importance” that are specific to the speech stimulus and the listener, thus permitting these regions to be compared across different listener groups. For example, in cross-linguistic and second-language (L2) speech perception, this method identifies differences in regions of importance in accomplishing decisions of phoneme category membership. This research note describes the application of bubble noise to the study of language learning for 3 different language pairs: Hindi English bilinguals' perception of the /v/–/w/ contrast in American English, native English speakers' perception of the tense/lax contrast for Korean fricatives and affricates, and native English speakers' perception of Mandarin lexical tone. Conclusion We demonstrate that this technique provides insight on what information in the speech signal is important for native/first-language listeners compared to nonnative/L2 listeners. Furthermore, the method can be used to examine whether L2 speech perception training is effective in bringing the listener's attention to the important cues.
  3. Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only finetuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.
  4. Recent studies have introduced methods for learning acoustic word embeddings (AWEs)—fixed-size vector representations of words which encode their acoustic features. Despite the widespread use of AWEs in speech processing research, they have only been evaluated quantitatively in their ability to discriminate between whole word tokens. To better understand the applications of AWEs in various downstream tasks and in cognitive modeling, we need to analyze the representation spaces of AWEs. Here we analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages. We first show that these AWEs preserve some information about words’ absolute duration and speaker. At the same time, the representation space of these AWEs is organized such that the distance between words’ embeddings increases with those words’ phonetic dissimilarity. Finally, the AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access. We argue this is a promising result and encourage further evaluation of AWEs as a potentially useful tool in cognitive science, which could provide a link between speech processing and lexical memory.
  5. Successful listening in a second language (L2) involves learning to identify the relevant acoustic–phonetic dimensions that differentiate between words in the L2, and then use these cues to access lexical representations during real-time comprehension. This is a particularly challenging goal to achieve when the relevant acoustic–phonetic dimensions in the L2 differ from those in the L1, as is the case for the L2 acquisition of Mandarin, a tonal language, by speakers of non-tonal languages like English. Previous work shows tone in L2 is perceived less categorically (Shen and Froud, 2019) and weighted less in word recognition (Pelzl et al., 2019) than in L1. However, little is known about the link between categorical perception of tone and use of tone in real time L2 word recognition at the level of the individual learner. This study presents evidence from 30 native and 29 L1-English speakers of Mandarin who completed a real-time spoken word recognition and a tone identification task. Results show that L2 learners differed from native speakers in both the extent to which they perceived tone categorically as well as in their ability to use tonal cues to distinguish between words in real-time comprehension. Critically, learners who reliably distinguished between wordsmore »differing by tone alone in the word recognition task also showed more categorical perception of tone on the identification task. Moreover, within this group, performance on the two tasks was strongly correlated. This provides the first direct evidence showing that the ability to perceive tone categorically is related to the weighting of tonal cues during spoken word recognition, thus contributing to a better understanding of the link between phonemic and lexical processing, which has been argued to be a key component in the L2 acquisition of tone (Wong and Perrachione, 2007).« less