skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 17, 2026

Title: Human and automated assessments of children's pronunciation
Patterns, types, and causes of errors in children’s pronunciation can be more variable than in adults’ speech. In school settings, different specialists work with children depending on their needs, including speech-language pathology (SLP) professionals and English as a second language (ESL) teachers. Because children’s speech is so variable, it is often difficult to identify which specialist is better suited to address a child’s needs. Computers excel at pattern recognition and can be quickly trained to identify a wide array of pronunciation issues, making them strong candidates to help with the difficult problem of identifying the appropriate specialist. As part of a larger project to create an automated pronunciation diagnostic tool to help identify which specialist a child may need, we created a pronunciation test for children between 5 and 7 years old. We recorded 26 children with a variety of language backgrounds and SLP needs and then compared automatic evaluations of their pronunciation to human evaluations. While the human evaluations showed high agreement, the automatic mispronunciation detection (MPD) system agreed on less than 50% of phonemes overall. However, the MPD showed consistent, albeit low, agreement across four subgroups of participants with no clear biases. Due to this performance, we recommend further research on children’s pronunciation and on specialized MPD systems that account for their unique speech characteristics and developmental patterns.  more » « less
Award ID(s):
2016984
PAR ID:
10647025
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Iowa State University Digital Press
Date Published:
Journal Name:
Proceedings of the annual Pronunciation in Second Language Learning and Teaching Conference
ISSN:
2380-9566
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Autonomous educational social robots can be used to help promote literacy skills in young children. Such robots, which emulate the emotive, perceptual, and empathic abilities of human teachers, are capable of replicating some of the benefits of one-on-one tutoring from human teachers, in part by leveraging individual student’s behavior and task performance data to infer sophisticated models of their knowledge. These student models are then used to provide personalized educational experiences by, for example, determining the optimal sequencing of curricular material. In this paper, we introduce an integrated system for autonomously analyzing and assessing children’s speech and pronunciation in the context of an interactive word game between a social robot and a child. We present a novel game environment and its computational formulation, an integrated pipeline for capturing and analyzing children’s speech in real-time, and an autonomous robot that models children’s word pronunciation via Gaussian Process Regression (GPR), augmented with an Active Learning protocol that informs the robot’s behavior. We show that the system is capable of autonomously assessing children’s pronunciation ability, with ground truth determined by a post-experiment evaluation by human raters. We also compare phoneme- and word-level GPR models and discuss trade-offs of each approach in modeling children’s pronunciation. Finally, we describe and analyze a pipeline for automatic analysis of children’s speech and pronunciation, including an evaluation of Speech Ace as a tool for future development of autonomous, speech-based language tutors. 
    more » « less
  2. We compared everyday language input to young congenitally-blind children with no addi- tional disabilities (N=15, 6–30 mo., M:16 mo.) and demographically-matched sighted peers (N=15, 6–31 mo., M:16 mo.). By studying whether the language input of blind children differs from their sighted peers, we aimed to determine whether, in principle, the language acquisition patterns observed in blind and sighted children could be explained by aspects of the speech they hear. Children wore LENA recorders to capture the auditory language environment in their homes. Speech in these recordings was then analyzed with a mix of automated and manually-transcribed measures across various subsets and dimensions of language input. These included measures of quantity (adult words), interaction (conversational turns and child-directed speech), linguistic properties (lexical diversity and mean length of utterance), and conceptual features (talk centered around the here-and-now; talk focused on visual referents that would be inaccessible to the blind but not sighted children). Overall, we found broad similarity across groups in speech quantitative, interactive, and linguistic properties. The only exception was that blind children’s language environments contained slightly but significantly more talk about past/future/hypothetical events than sighted children’s input; both groups received equiva- lent quantities of “visual” speech input. The findings challenge the notion that blind children’s lan- guage input diverges substantially from sighted children’s; while the input is highly variable across children, it is not systematically so across groups, across nearly all measures. The findings suggest instead that blind children and sighted children alike receive input that readily supports their language development, with open questions remaining regarding how this input may be differentially leveraged by language learners in early childhood. 
    more » « less
  3. N/A (Ed.)
    Automatic pronunciation assessment (APA) plays an important role in providing feedback for self-directed language learners in computer-assisted pronunciation training (CAPT). Several mispronunciation detection and diagnosis (MDD) systems have achieved promising performance based on end-to-end phoneme recognition. However, assessing the intelligibility of second language (L2) remains a challenging problem. One issue is the lack of large-scale labeled speech data from non-native speakers. Additionally, relying only on one aspect (e.g., accuracy) at a phonetic level may not provide a sufficient assessment of pronunciation quality and L2 intelligibility. It is possible to leverage segmental/phonetic-level features such as goodness of pronunciation (GOP), however, feature granularity may cause a discrepancy in prosodic-level (suprasegmental) pronunciation assessment. In this study, Wav2vec 2.0-based MDD and Goodness Of Pronunciation feature-based Transformer are employed to characterize L2 intelligibility. Here, an L2 speech dataset, with human-annotated prosodic (suprasegmental) labels, is used for multi-granular and multi-aspect pronunciation assessment and identification of factors important for intelligibility in L2 English speech. The study provides a transformative comparative assessment of automated pronunciation scores versus the relationship between suprasegmental features and listener perceptions, which taken collectively can help support the development of instantaneous assessment tools and solutions for L2 learners. 
    more » « less
  4. Speech is a fundamental aspect of human life, crucial not only for communication but also for cognitive, social, and academic development. Children with speech disorders (SD) face significant challenges that, if unaddressed, can result in lasting negative impacts. Traditionally, speech and language assessments (SLA) have been conducted by skilled speech-language pathologists (SLPs), but there is a growing need for efficient and scalable SLA methods powered by artificial intelligence. This position paper presents a survey of existing techniques suitable for automating SLA pipelines, with an emphasis on adapting automatic speech recognition (ASR) models for children’s speech, an overview of current SLAs and their automated counterparts to demonstrate the feasibility of AI-enhanced SLA pipelines, and a discussion of practical considerations, including accessibility and privacy concerns, associated with the deployment of AI-powered SLAs. 
    more » « less
  5. Abstract Psycholinguistic research on children's early language environments has revealed many potential challenges for language acquisition. One is that in many cases, referents of linguistic expressions are hard to identify without prior knowledge of the language. Likewise, the speech signal itself varies substantially in clarity, with some productions being very clear, and others being phonetically reduced, even to the point of uninterpretability. In this study, we sought to better characterize the language‐learning environment of American English‐learning toddlers by testing how well phonetic clarity and referential clarity align in infant‐directed speech. Using an existing Human Simulation Paradigm (HSP) corpus with referential transparency measurements and adding new measures of phonetic clarity, we found that the phonetic clarity of words’ first mentions significantly predicted referential clarity (how easy it was to guess the intended referent from visual information alone) at that moment. Thus, when parents’ speech was especially clear, the referential semantics were also clearer. This suggests that young children could use the phonetics of speech to identify globally valuable instances that support better referential hypotheses, by homing in on clearer instances and filtering out less‐clear ones. Such multimodal “gems” offer special opportunities for early word learning. Research HighlightsIn parent‐infant interaction, parents’ referential intentions are sometimes clear and sometimes unclear; likewise, parents’ pronunciation is sometimes clear and sometimes quite difficult to understand.We find that clearer referential instances go along with clearer phonetic instances, more so than expected by chance.Thus, there are globally valuable instances (“gems”) from which children could learn about words’ pronunciations and words’ meanings at the same time.Homing in on clear phonetic instances and filtering out less‐clear ones would help children identify these multimodal “gems” during word learning. 
    more » « less