An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS. 
                        more » 
                        « less   
                    This content will become publicly available on November 8, 2025
                            
                            Automatic Screening for Children with Speech Disorder Using Automatic Speech Recognition: Opportunities and Challenges
                        
                    
    
            Speech is a fundamental aspect of human life, crucial not only for communication but also for cognitive, social, and academic development. Children with speech disorders (SD) face significant challenges that, if unaddressed, can result in lasting negative impacts. Traditionally, speech and language assessments (SLA) have been conducted by skilled speech-language pathologists (SLPs), but there is a growing need for efficient and scalable SLA methods powered by artificial intelligence. This position paper presents a survey of existing techniques suitable for automating SLA pipelines, with an emphasis on adapting automatic speech recognition (ASR) models for children’s speech, an overview of current SLAs and their automated counterparts to demonstrate the feasibility of AI-enhanced SLA pipelines, and a discussion of practical considerations, including accessibility and privacy concerns, associated with the deployment of AI-powered SLAs. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2229873
- PAR ID:
- 10573563
- Publisher / Repository:
- PKP|PS
- Date Published:
- Journal Name:
- Proceedings of the AAAI Symposium Series
- Volume:
- 4
- Issue:
- 1
- ISSN:
- 2994-4317
- Page Range / eLocation ID:
- 308 to 313
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract In this study, the Indian Ocean upper-ocean variability associated with the subtropical Indian Ocean dipole (SIOD) is investigated. We find that the SIOD is associated with a prominent southwest–northeast sea level anomaly (SLA) dipole over the western-central south Indian Ocean, with the north pole located in the Seychelles–Chagos thermocline ridge (SCTR) and the south pole at southeast of Madagascar, which is different from the distribution of the sea surface temperature anomaly (SSTA). While the thermocline depth and upper-ocean heat content anomalies mirror SLAs, the air–sea CO2 flux anomalies associated with SIOD are controlled by SSTA. In the SCTR region, the westward propagation of oceanic Rossby waves generated by anomalous winds over the eastern tropical Indian Ocean is the major cause for the SLAs, with cyclonic wind causing negative SLAs during positive SIOD (pSIOD). Local wind forcing is the primary driver for the SLAs southeast of Madagascar, with anticyclonic winds causing positive SLAs. Since the SIOD is correlated with ENSO, the relative roles of the SIOD and ENSO are examined. We find that while ENSO can induce significant SLAs in the SCTR region through an atmospheric bridge, it has negligible impact on the SLA to the southeast of Madagascar. By contrast, the SIOD with ENSO influence removed is associated with an opposite SLA in the SCTR and southeast of Madagascar, corresponding to the SLA dipole identified above. A new subtropical dipole mode index (SDMI) is proposed, which is uncorrelated with ENSO and thus better represents the pure SIOD effect.more » « less
- 
            Speech is a natural channel for human-computer interaction in robotics and consumer applications. Natural language understanding pipelines that start with speech can have trouble recovering from speech recognition errors. Black-box automatic speech recognition (ASR) systems, built for general purpose use, are unable to take advantage of in-domain language models that could otherwise ameliorate these errors. In this work, we present a method for re-ranking black-box ASR hypotheses using an in-domain language model and semantic parser trained for a particular task. Our re-ranking method significantly improves both transcription accuracy and semantic understanding over a state-of-the-art ASR’s vanilla output.more » « less
- 
            Measuring how well human listeners recognize speech under varying environmental conditions (speech intelligibility) is a challenge for theoretical, technological, and clinical approaches to speech communication. The current gold standard—human transcription—is time- and resource-intensive. Recent advances in automatic speech recognition (ASR) systems raise the possibility of automating intelligibility measurement. This study tested 4 state-of-the-art ASR systems with second language speech-in-noise and found that one, whisper, performed at or above human listener accuracy. However, the content of whisper's responses diverged substantially from human responses, especially at lower signal-to-noise ratios, suggesting both opportunities and limitations for ASR--based speech intelligibility modeling.more » « less
- 
            Abstract INTRODUCTIONIdentification of individuals with mild cognitive impairment (MCI) who are at risk of developing Alzheimer's disease (AD) is crucial for early intervention and selection of clinical trials. METHODSWe applied natural language processing techniques along with machine learning methods to develop a method for automated prediction of progression to AD within 6 years using speech. The study design was evaluated on the neuropsychological test interviews ofn = 166 participants from the Framingham Heart Study, comprising 90 progressive MCI and 76 stable MCI cases. RESULTSOur best models, which used features generated from speech data, as well as age, sex, and education level, achieved an accuracy of 78.5% and a sensitivity of 81.1% to predict MCI‐to‐AD progression within 6 years. DISCUSSIONThe proposed method offers a fully automated procedure, providing an opportunity to develop an inexpensive, broadly accessible, and easy‐to‐administer screening tool for MCI‐to‐AD progression prediction, facilitating development of remote assessment. HighlightsVoice recordings from neuropsychological exams coupled with basic demographics can lead to strong predictive models of progression to dementia from mild cognitive impairment.The study leveraged AI methods for speech recognition and processed the resulting text using language models.The developed AI‐powered pipeline can lead to fully automated assessment that could enable remote and cost‐effective screening and prognosis for Alzehimer's disease.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
