skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A framework for evaluating speech representations
Listeners track distributions of speech sounds along perceptual dimensions. We introduce a method for evaluating hypotheses about what those dimensions are, using a cognitive model whose prior distribution is estimated directly from speech recordings. We use this method to evaluate two speaker normalization algorithms against human data. Simulations show that representations that are normalized across speakers predict human discrimination data better than unnormalized representations, consistent with previous research. Results further reveal differences across normalization methods in how well each predicts human data. This work provides a framework for evaluating hypothesized representations of speech and lays the groundwork for testing models of speech perception on natural speech recordings from ecologically valid settings.  more » « less
Award ID(s):
1320410
PAR ID:
10057883
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the Annual Conference of the Cognitive Science Society
ISSN:
1069-7977
Page Range / eLocation ID:
1919-1924
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce a method for measuring the correspondence between low-level speech features and human perception, using a cognitive model of speech perception implemented directly on speech recordings. We evaluate two speaker normalization techniques using this method and find that in both cases, speech features that are normalized across speakers predict human data better than unnormalized speech features, consistent with previous research. Results further reveal differences across normalization methods in how well each predicts human data. This work provides a new framework for evaluating low-level representations of speech on their match to human perception, and lays the groundwork for creating more ecologically valid models of speech perception. 
    more » « less
  2. When we vocalize, our brain distinguishes self-generated sounds from external ones. A corollary discharge signal supports this function in animals, however, in humans its exact origin and temporal dynamics remain unknown. We report Electrocorticographic (ECoG) recordings in neurosurgical patients and a novel connectivity approach based on Granger-causality that reveals major neural communications. We find a reproducible source for corollary discharge across multiple speech production paradigms localized to ventral speech motor cortex before speech articulation. The uncovered discharge predicts the degree of auditory cortex suppression during speech, its well-documented consequence. These results reveal the human corollary discharge source and timing with far-reaching implication for speech motor-control as well as auditory hallucinations in human psychosis. 
    more » « less
  3. When we vocalize, our brain distinguishes self-generated sounds from external ones. A corollary discharge signal supports this function in animals; however, in humans, its exact origin and temporal dynamics remain unknown. We report electrocorticographic recordings in neurosurgical patients and a connectivity analysis framework based on Granger causality that reveals major neural communications. We find a reproducible source for corollary discharge across multiple speech production paradigms localized to the ventral speech motor cortex before speech articulation. The uncovered discharge predicts the degree of auditory cortex suppression during speech, its well-documented consequence. These results reveal the human corollary discharge source and timing with far-reaching implication for speech motor-control as well as auditory hallucinations in human psychosis. 
    more » « less
  4. Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. 
    more » « less
  5. The advancement of Speech Emotion Recognition (SER) is significantly dependent on the quality of emotional speech corpora used for model training. Researchers in the field of SER have developed various corpora by adjusting design parameters to enhance the reliability of the training source. For this study, we focus on exploring communication modes of collection, specifically analyzing spontaneous emotional speech patterns gathered during conversation or monologue. While conversations are acknowledged as effective for eliciting authentic emotional expressions, systematic analyses are necessary to confirm their reliability as a better source of emotional speech data. We investigate this research question from perceptual differences and acoustic variability present in both emotional speeches. Our analyses on multi-lingual corpora show that, first, raters exhibit higher consistency for conversation recordings when evaluating categorical emotions, and second, perceptions and acoustic patterns observed in conversational samples align more closely with expected trends discussed in relevant emotion literature. We further examine the impact of these differences on SER modeling, which shows that we can train a more robust and stable SER model by using conversation data. This work provides comprehensive evidence suggesting that conversation may offer a better source compared to monologue for developing an SER model. 
    more » « less