skip to main content


This content will become publicly available on April 14, 2025

Title: Automatic Speech Recognition Tuned for Child Speech in the Classroom
Award ID(s):
2046505 2326170
PAR ID:
10520979
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-4485-1
Page Range / eLocation ID:
12291 to 12295
Format(s):
Medium: X
Location:
Seoul, Korea, Republic of
Sponsoring Org:
National Science Foundation
More Like this
  1. When listening to speech, our brain responses time lock to acoustic events in the stimulus. Recent studies have also reported that cortical responses track linguistic representations of speech. However, tracking of these representations is often described without controlling for acoustic properties. Therefore, the response to these linguistic representations might reflect unaccounted acoustic processing rather than language processing. Here, we evaluated the potential of several recently proposed linguistic representations as neural markers of speech comprehension. To do so, we investigated EEG responses to audiobook speech of 29 participants (22 females). We examined whether these representations contribute unique information over and beyond acoustic neural tracking and each other. Indeed, not all of these linguistic representations were significantly tracked after controlling for acoustic properties. However, phoneme surprisal, cohort entropy, word surprisal, and word frequency were all significantly tracked over and beyond acoustic properties. We also tested the generality of the associated responses by training on one story and testing on another. In general, the linguistic representations are tracked similarly across different stories spoken by different readers. These results suggests that these representations characterize the processing of the linguistic content of speech. SIGNIFICANCE STATEMENT For clinical applications, it would be desirable to develop a neural marker of speech comprehension derived from neural responses to continuous speech. Such a measure would allow for behavior-free evaluation of speech understanding; this would open doors toward better quantification of speech understanding in populations from whom obtaining behavioral measures may be difficult, such as young children or people with cognitive impairments, to allow better targeted interventions and better fitting of hearing devices. 
    more » « less
  2. An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS. 
    more » « less
  3. null (Ed.)