skip to main content


Title: Evaluation of Off-the-shelf Speech Recognizers on Different Accents in a Dialogue Domain
We evaluate several publicly available off-the-shelf (commercial and research) automatic speech recognition (ASR) systems on dialogue agent-directed English speech from speakers with General American vs. non-American accents. Our results show that the performance of the ASR systems for non-American accents is considerably worse than for General American accents. Depending on the recognizer, the absolute difference in performance between General American accents and all non-American accents combined can vary approximately from 2% to 12%, with relative differences varying approximately between 16% and 49%. This drop in performance becomes even larger when we consider specific categories of non-American accents indicating a need for more diligent collection of and training on non-native English speaker data in order to narrow this performance gap. There are performance differences across ASR systems, and while the same general pattern holds, with more errors for non-American accents, there are some accents for which the best recognizer is different than in the overall case. We expect these results to be useful for dialogue system designers in developing more robust inclusive dialogue systems, and for ASR providers in taking into account performance requirements for different accents.  more » « less
Award ID(s):
1852583
PAR ID:
10406314
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The way listeners perceive speech sounds is largely determined by the language(s) they were exposed to as a child. For example, native speakers of Japanese have a hard time discriminating between American English /ɹ/ and /l/, a phonetic contrast that has no equivalent in Japanese. Such effects are typically attributed to knowledge of sounds in the native language, but quantitative models of how these effects arise from linguistic knowledge are lacking. One possible source for such models is Automatic Speech Recognition (ASR) technology. We implement models based on two types of systems from the ASR literature—hidden Markov models (HMMs) and the more recent, and more accurate, neural network systems—and ask whether, in addition to showing better performance, the neural network systems also provide better models of human perception. We find that while both types of systems can account for Japanese natives’ difficulty with American English /ɹ/ and /l/, only the neural network system successfully accounts for Japanese natives’ facility with Japanese vowel length contrasts. Our work provides a new example, in the domain of speech perception, of an often observed correlation between task performance and similarity to human behavior. 
    more » « less
  2. Abstract

    Online testing for behavioral research has become an increasingly used tool. Although more researchers have been using online data collection methods, few studies have assessed the replicability of findings for speech intelligibility tasks. Here we assess intelligibility in quiet and two noise-added conditions for several different accents of English (Midland American, Standard Southern British, Scottish, German-accented, Mandarin-accented, Japanese-accented, and Hindi-English bilingual). Participants were tested in person at a museum-based laboratory and online. Results showed little to no difference between the two settings for the easier noise condition and in quiet, but large performance differences in the most difficult noise condition with an advantage for the participants tested online. Technology-based variables did not appear to drive the setting effect, but experimenter presence may have influenced response strategy for the in-person group and differences in demographics could have provided advantages for the online group. Additional research should continue to investigate how setting, demographic factors, experimenter presence, and motivational factors interact to determine performance in speech perception experiments.

     
    more » « less
  3. This paper evaluates the performance of widely-used open-source automatic speech recognition systems in transcribing primarily African American English-speaking children’s speech for educational applications. We investigate the performance of the Whisper, HuBERT, and Wav2Vec2 ASR systems as well as the capability of the transformer-based language model, BERT, for automatically grading the student’s oral responses to assessment prompts through use of the generated ASR transcripts. We achieve a 95% oral response scoring accuracy through the methods described. We also show a thorough analysis of ASR system performance over a diverse set of metrics going beyond the standard word error rate. 
    more » « less
  4. null (Ed.)
    Human listeners use specific cues to recognize speech and recent experiments have shown that certain time-frequency regions of individual utterances are more important to their correct identification than others. A model that could identify such cues or regions from clean speech would facilitate speech recognition and speech enhancement by focusing on those important regions. Thus, in this paper we present a model that can predict the regions of individual utterances that are important to an automatic speech recognition (ASR) “listener” by learning to add as much noise as possible to these utterances while still permitting the ASR to correctly identify them. This work utilizes a continuous speech recognizer to recognize multi-word utterances and builds upon our previous work that performed the same process for an isolated word recognizer. Our experimental results indicate that our model can apply noise to obscure 90.5% of the spectrogram while leaving recognition performance nearly unchanged. 
    more » « less
  5. Phrase-level prosodic prominence in American English is understood, in the AM tradition, to be marked by pitch accents. While such prominences are characterized via tonal labels in ToBI (e.g. H*), their cues are not exclusively in the pitch domain: timing, loudness and voice quality are known to contribute to prominence perception. All of these cues occur with a wide degree of variability in naturally produced speech, and this variation may be informative. In this study, we advance towards a system of explicit labelling of individual cues to prosodic structure, here focusing on phrase-level prominence. We examine correlations between the presence of a set of 6 cues to prominence (relating to segment duration, loudness, and non-modal phonation, in addition to f0) and pitch accent labels in a corpus of ToBI-labelled American English speech. Results suggest that tokens with more cues are more likely to receive a pitch accent label. 
    more » « less