skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Decoding speech sounds from neurophysiological data: Practical considerations and theoretical implications
Machine learning techniques have proven to be a useful tool in cognitive neuroscience. However, their implementation in scalp‐recorded electroencephalography (EEG) is relatively limited. To address this, we present three analyses using data from a previous study that examined event‐related potential (ERP) responses to a wide range of naturally‐produced speech sounds. First, we explore which features of the EEG signal best maximize machine learning accuracy for a voicing distinction, using a support vector machine (SVM). We manipulate three dimensions of the EEG signal as input to the SVM: number of trials averaged, number of time points averaged, and polynomial fit. We discuss the trade‐offs in using different feature sets and offer some recommendations for researchers using machine learning. Next, we use SVMs to classify specific pairs of phonemes, finding that we can detect differences in the EEG signal that are not otherwise detectable using conventional ERP analyses. Finally, we characterize the timecourse of phonetic feature decoding across three phonological dimensions (voicing, manner of articulation, and place of articulation), and find that voicing and manner are decodable from neural activity, whereas place of articulation is not. This set of analyses addresses both practical considerations in the application of machine learning to EEG, particularly for speech studies, and also sheds light on current issues regarding the nature of perceptual representations of speech.  more » « less
Award ID(s):
1945069
PAR ID:
10509978
Author(s) / Creator(s):
;
Publisher / Repository:
Wiley
Date Published:
Journal Name:
Psychophysiology
Volume:
61
Issue:
4
ISSN:
0048-5772
Subject(s) / Keyword(s):
Analysis/Statistical Methods Auditory Processes EEG ERPs Language/Speech Machine Learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The evolution of Web Speech has increased the ease of development and public availability of auditory description without the use of screen reader software, broadening its exposure to users who may benefit from spoken descriptions. Building off an existing design framework for auditory description of interactive web media, we have designed an optional Voicing feature instantiated in two PhET Interactive Simulations regularly used by students and educators globally. We surveyed over 2000 educators to investigate their perceptions and preferences of the Web Speech-based Voicing feature and its broad appeal and effectiveness for teaching and learning. We find a general approval by educators of the Voicing feature and more moderate statement ratings than expected to the different preset speech levels we presented to them. We find that educators perceive the feature as beneficial both broadly and for specific populations while some acknowledge particular populations for whom it remains ineffective. Lastly, we identify some variance in the perceptions of the feature based on different aspects of the simulation experience. 
    more » « less
  2. null (Ed.)
    Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents(VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN(CSG), which learns the relationship between emotion, lexical content and lip movements in a principled manner. This model uses a set of spectral and emotional speech features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model(CSG-Emo-Aware). Objective evaluations of these models show improvements for the CSG-Emo-Adapted compared with the CSG model, as the trajectory sequences are closer to the original sequences. Subjective evaluations show significantly better results for this model compared with the CSG model when the target emotion is happiness. 
    more » « less
  3. Abstract Objective. Neurological disorders affecting speech production adversely impact quality of life for over 7 million individuals in the US. Traditional speech interfaces like eye-tracking devices and P300 spellers are slow and unnatural for these patients. An alternative solution, speech brain-computer interfaces (BCIs), directly decodes speech characteristics, offering a more natural communication mechanism. This research explores the feasibility of decoding speech features using non-invasive EEG.Approach. Nine neurologically intact participants were equipped with a 63-channel EEG system with additional sensors to eliminate eye artifacts. Participants read aloud sentences selected for phonetic similarity to the English language. Deep learning models, including Convolutional Neural Networks and Recurrent Neural Networks with and without attention modules, were optimized with a focus on minimizing trainable parameters and utilizing small input window sizes for real-time application. These models were employed for discrete and continuous speech decoding tasks.Main results. Statistically significant participant-independent decoding performance was achieved for discrete classes and continuous characteristics of the produced audio signal. A frequency sub-band analysis highlighted the significance of certain frequency bands (delta, theta, gamma) for decoding performance, and a perturbation analysis was used to identify crucial channels. Assessed channel selection methods did not significantly improve performance, suggesting a distributed representation of speech information encoded in the EEG signals. Leave-One-Out training demonstrated the feasibility of utilizing common speech neural correlates, reducing data collection requirements from individual participants.Significance. These findings contribute significantly to the development of EEG-enabled speech synthesis by demonstrating the feasibility of decoding both discrete and continuous speech features from EEG signals, even in the presence of EMG artifacts. By addressing the challenges of EMG interference and optimizing deep learning models for speech decoding, this study lays a strong foundation for EEG-based speech BCIs. 
    more » « less
  4. Predictions of gradient degree of lenition of voiceless and voiced stops in a corpus of Argentine Spanish are evaluated using three acoustic measures (minimum and maximum intensity velocity and duration) and two recurrent neural network (Phonet) measures (posterior probabilities of sonorant and continuant phonological features). While mixed and inconsistent predictions were obtained across the acoustic metrics, sonorant and continuant probability values were consistently in the direction predicted by known factors of a stop's lenition with respect to its voicing, place of articulation, and surrounding contexts. The results suggest the effectiveness of Phonet as an additional or alternative method of lenition measurement. Furthermore, this study has enhanced the accessibility of Phonet by releasing the trained Spanish Phonet model used in this study and a pipeline with step-by-step instructions for training and inferencing new models. 
    more » « less
  5. Brain Computer Interfaces (BCIs) traditionally deploy visual or auditory stimuli to elicit brain signals. However, these stimuli are not very useful in situations where the visual or auditory senses are involved in other decision making processes. In this paper, we explore the use of vibrotactile stimuli on the fi ngers as a viable replacement. Using a fi ve-level Wavelet Packet feature extraction on the obtained EEG signals, along with a kernel Support Vector Machine (SVM) algorithm, we were able to achieve 83% classi cation accuracy for binary user choices. This new BCI paradigm shows potential for use in situations where visual and auditory stimuli are not feasible. 
    more » « less