The lack of authentic stuttered speech data has significantly limited the development of stuttering friendly automatic speech recognition (ASR) models. In previous work, we collaborated with StammerTalk, a grassroots community of Chinese-speaking people who stutter (PWS), to collect the first stuttered speech dataset in Mandarin Chinese, containing 50 hours of conversational and command-recitation speech from 72 PWS. This work examines both the technical and social dimensions of the dataset. Through quantitative and qualitative analysis, as well as benchmarking and fine-tuning ASR models using the dataset, we demonstrate its technical value in capturing stuttered speech at an unprecedented scale and diversity – enabling better understanding and mitigation of fluency bias in ASR – and its social value in promoting self-advocacy and structural change for PWS in China. By foregrounding lived experiences of PWS in their own voices, we also see the potential of this dataset to normalize speech disfluencies and cultivate deeper empathy for stuttering within the AI research community.
more »
« less
Self-supervised Speech Models for Word-Level Stuttered Speech Detection
Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and requires clinicians with training and experience in stuttering and fluency disorders. Unfortunately, only a small percentage of speech-language pathologists report being comfortable working with individuals who stutter, which is inadequate to accommodate for the 80 million individuals who stutter worldwide. Developing machine learning models for detecting stuttered speech would enable universal and automated screening for stuttering, enabling speech pathologists to identify and follow up with patients who are most likely to be diagnosed with a stuttering speech disorder. Previous research in this area has predominantly focused on utterance-level detection, which is not sufficient for clinical settings where word-level annotation of stuttering is the norm. In this study, we curated a stuttered speech dataset with word-level annotations and introduced a word-level stuttering speech detection model leveraging self-supervised speech models. Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection. Additionally, we conducted an extensive ablation analysis of our method, providing insight into the most important aspects of adapting self-supervised speech models for stuttered speech detection.
more »
« less
- Award ID(s):
- 2505865
- PAR ID:
- 10631905
- Publisher / Repository:
- IEEE SLT 2024
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Purpose: Stuttering-like disfluencies (SLDs) and typical disfluencies (TDs) are both more likely to occur as utterance length increases. However, longer and shorter utterances differ by more than the number of morphemes: They may also serve different communicative functions or describe different ideas. Decontextualized language, or language that describes events and concepts outside of the “here and now,” is associated with longer utterances. Prior work has shown that language samples taken in decontextualized contexts contain more disfluencies, but averaging across an entire language sample creates a confound between utterance length and decontextualization as contributors to stuttering. We coded individual utterances from naturalistic play samples to test the hypothesis that decontextualized language leads to increased disfluencies above and beyond the effects of utterance length. Method: We used archival transcripts of language samples from 15 preschool children who stutter (CWS) and 15 age- and sex-matched children who do not stutter (CWNS). Utterances were coded as either contextualized or decontextualized, and we used mixed-effects logistic regression to investigate the impact of utterance length and decontextualization on SLDs and TDs. Results: CWS were more likely to stutter when producing decontextualized utterances, even when controlling for utterance length. An interaction between decontextualization and utterance length indicated that the effect of decontextualization was greatest for shorter utterances. TDs increased in decontextualized utterances when controlling for utterance length for both CWS and CWNS. The effect of decontextualization on TDs did not differ statistically between the two groups. Conclusions: The increased working memory demands associated with decontextualized language contribute to increased language planning effort. This leads to increased TD in CWS and CWNS. Under a multifactorial dynamic model of stuttering, the increased language demands may also contribute to increased stuttering in CWS due to instabilities in their speech motor systems.more » « less
-
The presented first-of-its-kind study effectively identifies and visualizes the second-by-second pattern differences in the physiological arousal of preschool-age children who do stutter (CWS) and who do not stutter (CWNS) while speaking perceptually fluently in two challenging conditions: speaking in stressful situations and narration. The first condition may affect children's speech due to high arousal; the latter introduces linguistic, cognitive, and communicative demands on speakers. We collected physiological parameters data from 70 children in the two target conditions. First, we adopt a novel modality-wise multiple-instance-learning (MI-MIL) approach to classify CWS vs. CWNS in different conditions effectively. The evaluation of this classifier addresses four critical research questions that align with state-of-the-art speech science studies' interests. Later, we leverage SHAP classifier interpretations to visualize the salient, fine-grain, and temporal physiological parameters unique to CWS at the population/group-level and personalized-level. While group-level identification of distinct patterns would enhance our understanding of stuttering etiology and development, the personalized-level identification would enable remote, continuous, and real-time assessment of stuttering children's physiological arousal, which may lead to personalized, just-in-time interventions, resulting in an improvement in speech fluency. The presented MI-MIL approach is novel, generalizable to different domains, and real-time executable. Finally, comprehensive evaluations are done on multiple datasets, presented framework, and several baselines that identified notable insights on CWSs' physiological arousal during speech production.more » « less
-
Muresan, Smaranda; Nakov, Preslav; Villavicencio, Aline (Ed.)Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms.more » « less
-
Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.more » « less
An official website of the United States government

