skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 1, 2025

Title: Direct articulatory observation reveals phoneme recognition performance characteristics of a self-supervised speech model
Variability in speech pronunciation is widely observed across different linguistic backgrounds, which impacts modern automatic speech recognition performance. Here, we evaluate the performance of a self-supervised speech model in phoneme recognition using direct articulatory evidence. Findings indicate significant differences in phoneme recognition, especially in front vowels, between American English and Indian English speakers. To gain a deeper understanding of these differences, we conduct real-time MRI-based articulatory analysis, revealing distinct velar region patterns during the production of specific front vowels. This underscores the need to deepen the scientific understanding of self-supervised speech model variances to advance robust and inclusive speech technology.  more » « less
Award ID(s):
2311676
PAR ID:
10574485
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
AIP
Date Published:
Journal Name:
JASA Express Letters
Volume:
4
Issue:
11
ISSN:
2691-1191
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Muresan, Smaranda; Nakov, Preslav; Villavicencio, Aline (Ed.)
    Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms. 
    more » « less
  2. N/A (Ed.)
    This study is focused on understanding and quantifying the change in phoneme and prosody information encoded in the Self-Supervised Learning (SSL) model, brought by an accent identification (AID) fine-tuning task. This problem is addressed based on model probing. Specifically, we conduct a systematic layer-wise analysis of the representations of the Transformer layers on a phoneme correlation task, and a novel word-level prosody prediction task. We compare the probing performance of the pre-trained and fine-tuned SSL models. Results show that the AID fine-tuning task steers the top 2 layers to learn richer phoneme and prosody representation. These changes share some similarities with the effects of fine-tuning with an Automatic Speech Recognition task. In addition, we observe strong accent-specific phoneme representations in layer 9. To sum up, this study provides insights into the understanding of SSL features and their interactions with fine-tuning tasks. 
    more » « less
  3. In this paper, we study speech development in children using longitudinal acoustic and articulatory data. Data were collected yearly from grade 1 to grade 4 from four female and four male children. We analyze acoustic and articulatory properties of four corner vowels: /æ/, /i/, /u/, and /A/, each occurring in two different words (different surrounding contexts). Acoustic features include formant frequencies and subglottal resonances (SGRs). Articulatory features include tongue curvature degree (TCD) and tongue curvature position (TCP). Based on the analyses, we observe the emergence of sex-based differences starting from grade 2. Similar to adults, the SGRs divide the vowel space into high, low, front, and back regions at least as early as grade 2. On average, TCD is correlated with vowel height and TCP with vowel frontness. Children in our study used varied articulatory configurations to achieve similar acoustic targets. 
    more » « less
  4. Stand-alone devices for tactile speech reception serve a need as communication aids for persons with profound sensory impairments as well as in applications such as human-computer interfaces and remote communication when the normal auditory and visual channels are compromised or overloaded. The current research is concerned with perceptual evaluations of a phoneme-based tactile speech communication device in which a unique tactile code was assigned to each of the 24 consonants and 15 vowels of English. The tactile phonemic display was conveyed through an array of 24 tactors that stimulated the dorsal and ventral surfaces of the forearm. Experiments examined the recognition of individual words as a function of the inter-phoneme interval (Study 1) and two-word phrases as a function of the inter-word interval (Study 2). Following an average training period of 4.3 hrs on phoneme and word recognition tasks, mean scores for the recognition of individual words in Study 1 ranged from 87.7% correct to 74.3% correct as the inter-phoneme interval decreased from 300 to 0 ms. In Study 2, following an average of 2.5 hours of training on the two-word phrase task, both words in the phrase were identified with an accuracy of 75% correct using an inter-word interval of 1 sec and an inter-phoneme interval of 150 ms. Effective transmission rates achieved on this task were estimated to be on the order of 30 to 35 words/min. 
    more » « less
  5. Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only finetuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility. 
    more » « less