skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fair or Fare? Understanding Automated Transcription Error Bias in Social Media and Videoconferencing Platforms
As remote work and learning increases in popularity, individuals, especially those with hearing impairments or who speak English as a second language, may depend on automated transcriptions to participate in business, school, entertainment, or basic communication. In this work, we investigate the automated transcription accuracy of seven popular social media and videoconferencing platforms with respect to some personal characteristics of their users, including gender, age, race, first language, speech rate, F0 frequency, and speech readability. We performed this investigation on a new corpus of 194 hours of English monologues by 846 TED talk speakers. Our results show the presence of significant bias, with transcripts less accurate for speakers that are male or non-native English speakers. We also observe differences in accuracy among platforms for different types of speakers. These results indicate that, while platforms have improved their automatic captioning, much work remains to make captions accessible for a wider variety of speakers and listeners.  more » « less
Award ID(s):
1955227
PAR ID:
10531479
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Proceedings of the International AAAI Conference on Web and Social Media
Date Published:
Journal Name:
Proceedings of the International AAAI Conference on Web and Social Media
Volume:
18
ISSN:
2162-3449
Page Range / eLocation ID:
367 to 380
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While a range of measures based on speech production, language, and perception are possible (Manun et al., 2020) for the prediction and estimation of speech intelligibility, what constitutes second language (L2) intelligibility remains under-defined. Prosodic and temporal features (i.e., stress, speech rate, rhythm, and pause placement) have been shown to impact listener perception (Kang et al., 2020). Still, their relationship with highly intelligible speech is yet unclear. This study aimed to characterize L2 speech intelligibility. Acoustic analyses, including PRAAT and Python scripts, were conducted on 405 speech samples (30 s) from 102 L2 English speakers with a wide variety of backgrounds, proficiency levels, and intelligibility levels. The results indicate that highly intelligible speakers of English employ between 2 and 4 syllables per second and that higher or lower speeds are less intelligible. Silent pauses between 0.3 and 0.8 s were associated with the highest levels of intelligibility. Rhythm, measured by Δ syllable length of all content syllables, was marginally associated with intelligibility. Finally, lexical stress accuracy did not interfere substantially with intelligibility until less than 70% of the polysyllabic words were incorrect. These findings inform the fields of first and second language research as well as language education and pathology. 
    more » « less
  2. Purpose:We examined which measures of complexity are most informative when studying language produced in interaction. Specifically, using these measures, we explored whether native and nonnative speakers modified the higher level properties of their production beyond the acoustic–phonetic level based on the language background of their conversation partner. Method:Using a subset of production data from the Wildcat Corpus that used Diapix, an interactive picture matching task, to elicit production, we compared English language production at the dyad and individual level across three different pair types: eight native pairs (English–English), eight mixed pairs (four English–Chinese and four English–Korean), and eight nonnative pairs (four Chinese–Chinese and four Korean–Korean). Results:At both the dyad and individual levels, native speakers produced longer and more clausally dense speech. They also produced fewer silent pauses and fewer linguistic mazes relative to nonnative speakers. Speakers did not modify their production based on the language background of their interlocutor. Conclusions:The current study examines higher level properties of language production in true interaction. Our results suggest that speakers' productions were determined by their own language background and were independent of that of their interlocutor. Furthermore, these demonstrated promise for capturing syntactic characteristics of language produced in true dialogue. Supplemental Material:https://doi.org/10.23641/asha.24712956 
    more » « less
  3. Speech recognition by both humans and machines frequently fails in non-optimal yet common situations. For example, word recognition error rates for second-language (L2) speech can be high, especially under conditions involving background noise. At the same time, both human and machine speech recognition sometimes shows remarkable robustness against signal- and noise-related degradation. Which acoustic features of speech explain this substantial variation in intelligibility? Current approaches align speech to text to extract a small set of pre-defined spectro-temporal properties from specific sounds in particular words. However, variation in these properties leaves much cross-talker variation in intelligibility unexplained. We examine an alternative approach utilizing a perceptual similarity space acquired using self-supervised learning. This approach encodes distinctions between speech samples without requiring pre-defined acoustic features or speech-to-text alignment. We show that L2 English speech samples are less tightly clustered in the space than L1 samples reflecting variability in English proficiency among L2 talkers. Critically, distances in this similarity space are perceptually meaningful: L1 English listeners have lower recognition accuracy for L2 speakers whose speech is more distant in the space from L1 speech. These results indicate that perceptual similarity may form the basis for an entirely new speech and language analysis approach. 
    more » « less
  4. Aims and objectives:This paper analyzes the extent to which new speakers are participating in an ongoing phonological change in Diné Bizaad (Navajo). The implications of these patterns are discussed as they relate to theories of new speakers and language change. Methodology design:I apply a variationist methodology to analyze the pronunciation of lateral affricates from speakers representing different generations and language learning contexts. I focus on comparing new speakers, who report acquiring the language primarily through school or in a language program, with their age-equivalent peers. Data and analysis:The data come from interviews recorded with 51 bilingual Diné Bizaad-English participants, ages 18–75. This includes four new speakers. The analysis focuses on variation in the lateral affricates in connected speech samples and an oral translation task. Findings/conclusion:Results reveal that new speakers diverge from other younger participants in their lack of participation in an ongoing change in the affricates. Instead, new speakers more closely resemble middle-aged and older speakers. Originality:This study applies the new speaker framework to an Indigenous North American language, an under-represented sociolinguistic context within the literature. These findings provide a counterexample to the more frequent finding of new speakers linguistically diverging from older, traditional speakers. Significance/implications:These results are interpreted as arising due to literacy practices, language usage networks, and community values. The orthographic representation of the affricates is thought to inhibit sound change. At the same time, due to their more formal language learning background, new speakers have developed a self-monitored speech style oriented toward the prestigious, older speakers. A lack of peer group language usage is thought to prevent the development of linguistically or ideologically distinct new speaker varieties. The confluence of these factors means that instead of constituting agents of language change, new speakers are more similar to older participants. 
    more » « less
  5. Does knowledge of language transfer spontaneously across language modalities? For example, do English speakers, who have had no command of a sign language, spontaneously project grammatical constraints from English to linguistic signs? Here, we address this question by examining the constraints on doubling. We first demonstrate that doubling (e.g., panana, generally, ABB) is amenable to two conflicting parses (identity vs. reduplication), depending on the level of analysis (phonology vs. morphology). We next show that speakers with no command of a sign language spontaneously project these two parses to novel ABB signs in American Sign language. Moreover, the chosen parse (for signs) is constrained by the morphology of spoken language. Hebrew speakers can project the morphological parse when doubling indicates diminution, but English speakers only do so when doubling indicates plurality, in line with the distinct morphological properties of their spoken languages. These observations suggest that doubling in speech and signs is constrained by a common set of linguistic principles that are algebraic, amodal and abstract. 
    more » « less