This study examines the perceptual trade-off between knowledge of a language’s statistical regularities and reliance on the acoustic signal during L2 spoken word recognition. We test how early learners track and make use of segmental and suprasegmental cues and their relative frequencies during non-native word recognition. English learners of Mandarin were taught an artificial tonal language in which a tone’s informativeness for word identification varied according to neighborhood density. The stimuli mimicked Mandarin’s uneven distribution of syllable+tone combinations by varying syllable frequency and the probability of particular tones co-occurring with a particular syllable. Use of statistical regularities was measured by four-alternative forced-choice judgments and by eye fixations to target and competitor symbols. Half of the participants were trained on one speaker, that is, low speaker variability while the other half were trained on four speakers. After four days of learning, the results confirmed that tones are processed according to their informativeness. Eye movements to the newly learned symbols demonstrated that L2 learners use tonal probabilities at an early stage of word recognition, regardless of speaker variability. The amount of variability in the signal, however, influenced the time course of recovery from incorrect anticipatory looks: participants exposed to low speaker variability recovered from incorrect probability-based predictions of tone more rapidly than participants exposed to greater variability. These results motivate two conclusions: early L2 learners track the distribution of segmental and suprasegmental co-occurrences and make predictions accordingly during spoken word recognition; and when the acoustic input is more variable because of multi-speaker input, listeners rely more on their knowledge of tone-syllable co-occurrence frequency distributions and less on the incoming acoustic signal.
- Award ID(s):
- 1655126
- NSF-PAR ID:
- 10319177
- Date Published:
- Journal Name:
- Journal of the International Neuropsychological Society
- Volume:
- 27
- Issue:
- 1
- ISSN:
- 1355-6177
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This study uses a response mouse-tracking paradigm to examine the role of sub-phonemic information in online lexical ambiguity resolution of continuous speech. We examine listeners’ sensitivity to the sub-phonemic information that is specific to the ambiguous internal open juncture /s/-stop sequences in American English (e.g., “ place kin” vs. “ play skin”), that is, voice onset time (VOT) indicating different degrees of aspiration (e.g., long VOT for “ k in” vs. short VOT for “ s k in”) in connected speech contexts. A cross-splicing method was used to create two-word sequences (e.g., “ place kin” or “ play skin”) with matching VOTs (long for “ k in”; short for “ s k in”) or mismatching VOTs ( short for “ k in”; long for “ s k in”). Participants ( n = 20) heard the two-word sequences, while looking at computer displays with the second word in the left/right corner (“ KIN” and “ SKIN”). Then, listeners’ click responses and mouse movement trajectories were recorded. Click responses show significant effects of VOT manipulation, while mouse trajectories do not. Our results show that stop-release information, whether temporal or spectral, can (mis)guide listeners’ interpretation of the possible location of a word boundary between /s/ and a following stop, even when other aspects in the acoustic signal (e.g., duration of /s/) point to the alternative segmentation. Taken together, our results suggest that segmentation and lexical access are highly attuned to bottom-up phonetic information; our results have implications for a model of spoken language recognition with position-specific representations available at the prelexical level and also allude to the possibility that detailed phonetic information may be stored in the listeners’ lexicons.
-
Although the application of deep learning to automatic speech recognition (ASR) has resulted in dramatic reductions in word error rate for languages with abundant training data, ASR for languages with few resources has yet to benefit from deep learning to the same extent. In this paper, we investigate various methods of acoustic modeling and data augmentation with the goal of improving the accuracy of a deep learning ASR framework for a low-resource language with a high baseline word error rate. We compare several methods of generating synthetic acoustic training data via voice transformation and signal distortion, and we explore several strategies for integrating this data into the acoustic training pipeline. We evaluate our methods on an indigenous language of North America with minimal training resources. We show that training initially via transfer learning from an existing high-resource language acoustic model, refining weights using a heavily concentrated synthetic dataset, and finally fine-tuning to the target language using limited synthetic data reduces WER by 15% over just transfer learning using deep recurrent methods. Further, we show improvements over traditional frameworks by 19% using a similar multistage training with deep convolutional approaches.more » « less
-
Learning to process speech in a foreign language involves learning new representations for mapping the auditory signal to linguistic structure. Behavioral experiments suggest that even listeners that are highly proficient in a non-native language experience interference from representations of their native language. However, much of the evidence for such interference comes from tasks that may inadvertently increase the salience of native language competitors. Here we tested for neural evidence of proficiency and native language interference in a naturalistic story listening task. We studied electroencephalography responses of 39 native speakers of Dutch (14 male) to an English short story, spoken by a native speaker of either American English or Dutch. We modeled brain responses with multivariate temporal response functions, using acoustic and language models. We found evidence for activation of Dutch language statistics when listening to English, but only when it was spoken with a Dutch accent. This suggests that a naturalistic, monolingual setting decreases the interference from native language representations, whereas an accent in the listener's own native language may increase native language interference, by increasing the salience of the native language and activating native language phonetic and lexical representations. Brain responses suggest that such interference stems from words from the native language competing with the foreign language in a single word recognition system, rather than being activated in a parallel lexicon. We further found that secondary acoustic representations of speech (after 200 ms latency) decreased with increasing proficiency. This may reflect improved acoustic–phonetic models in more proficient listeners.
Significance Statement Behavioral experiments suggest that native language knowledge interferes with foreign language listening, but such effects may be sensitive to task manipulations, as tasks that increase metalinguistic awareness may also increase native language interference. This highlights the need for studying non-native speech processing using naturalistic tasks. We measured neural responses unobtrusively while participants listened for comprehension and characterized the influence of proficiency at multiple levels of representation. We found that salience of the native language, as manipulated through speaker accent, affected activation of native language representations: significant evidence for activation of native language (Dutch) categories was only obtained when the speaker had a Dutch accent, whereas no significant interference was found to a speaker with a native (American) accent. -
Abstract Multilingual speakers can find speech recognition in everyday environments like restaurants and open-plan offices particularly challenging. In a world where speaking multiple languages is increasingly common, effective clinical and educational interventions will require a better understanding of how factors like multilingual contexts and listeners’ language proficiency interact with adverse listening environments. For example, word and phrase recognition is facilitated when competing voices speak different languages. Is this due to a “release from masking” from lower-level acoustic differences between languages and talkers, or higher-level cognitive and linguistic factors? To address this question, we created a “one-man bilingual cocktail party” selective attention task using English and Mandarin speech from one bilingual talker to reduce low-level acoustic cues. In Experiment 1, 58 listeners more accurately recognized English targets when distracting speech was Mandarin compared to English. Bilingual Mandarin–English listeners experienced significantly more interference and intrusions from the Mandarin distractor than did English listeners, exacerbated by challenging target-to-masker ratios. In Experiment 2, 29 Mandarin–English bilingual listeners exhibited linguistic release from masking in both languages. Bilinguals experienced greater release from masking when attending to English, confirming an influence of linguistic knowledge on the “cocktail party” paradigm that is separate from primarily energetic masking effects. Effects of higher-order language processing and expertise emerge only in the most demanding target-to-masker contexts. The “one-man bilingual cocktail party” establishes a useful tool for future investigations and characterization of communication challenges in the large and growing worldwide community of Mandarin–English bilinguals.