skip to main content

This content will become publicly available on December 5, 2024

Title: Neural tracking measures of speech intelligibility: Manipulating intelligibility while keeping acoustics unchanged

Neural speech tracking has advanced our understanding of how our brains rapidly map an acoustic speech signal onto linguistic representations and ultimately meaning. It remains unclear, however, how speech intelligibility is related to the corresponding neural responses. Many studies addressing this question vary the level of intelligibility by manipulating the acoustic waveform, but this makes it difficult to cleanly disentangle the effects of intelligibility from underlying acoustical confounds. Here, using magnetoencephalography recordings, we study neural measures of speech intelligibility by manipulating intelligibility while keeping the acoustics strictly unchanged. Acoustically identical degraded speech stimuli (three-band noise-vocoded, ~20 s duration) are presented twice, but the second presentation is preceded by the original (nondegraded) version of the speech. This intermediate priming, which generates a “pop-out” percept, substantially improves the intelligibility of the second degraded speech passage. We investigate how intelligibility and acoustical structure affect acoustic and linguistic neural representations using multivariate temporal response functions (mTRFs). As expected, behavioral results confirm that perceived speech clarity is improved by priming. mTRFs analysis reveals that auditory (speech envelope and envelope onset) neural representations are not affected by priming but only by the acoustics of the stimuli (bottom–up driven). Critically, our findings suggest that segmentation of sounds into words emerges with better speech intelligibility, and most strongly at the later (~400 ms latency) word processing stage, in prefrontal cortex, in line with engagement of top–down mechanisms associated with priming. Taken together, our results show that word representations may provide some objective measures of speech comprehension.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Publisher / Repository:
National Academy of Sciences
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. When listening to speech, our brain responses time lock to acoustic events in the stimulus. Recent studies have also reported that cortical responses track linguistic representations of speech. However, tracking of these representations is often described without controlling for acoustic properties. Therefore, the response to these linguistic representations might reflect unaccounted acoustic processing rather than language processing. Here, we evaluated the potential of several recently proposed linguistic representations as neural markers of speech comprehension. To do so, we investigated EEG responses to audiobook speech of 29 participants (22 females). We examined whether these representations contribute unique information over and beyond acoustic neural tracking and each other. Indeed, not all of these linguistic representations were significantly tracked after controlling for acoustic properties. However, phoneme surprisal, cohort entropy, word surprisal, and word frequency were all significantly tracked over and beyond acoustic properties. We also tested the generality of the associated responses by training on one story and testing on another. In general, the linguistic representations are tracked similarly across different stories spoken by different readers. These results suggests that these representations characterize the processing of the linguistic content of speech. SIGNIFICANCE STATEMENT For clinical applications, it would be desirable to develop a neural marker of speech comprehension derived from neural responses to continuous speech. Such a measure would allow for behavior-free evaluation of speech understanding; this would open doors toward better quantification of speech understanding in populations from whom obtaining behavioral measures may be difficult, such as young children or people with cognitive impairments, to allow better targeted interventions and better fitting of hearing devices. 
    more » « less
  2. null (Ed.)
    Aging is associated with an exaggerated representation of the speech envelope in auditory cortex. The relationship between this age-related exaggerated response and a listener’s ability to understand speech in noise remains an open question. Here, information-theory-based analysis methods are applied to magnetoencephalography recordings of human listeners, investigating their cortical responses to continuous speech, using the novel nonlinear measure of phase-locked mutual information between the speech stimuli and cortical responses. The cortex of older listeners shows an exaggerated level of mutual information, compared with younger listeners, for both attended and unattended speakers. The mutual information peaks for several distinct latencies: early (∼50 ms), middle (∼100 ms), and late (∼200 ms). For the late component, the neural enhancement of attended over unattended speech is affected by stimulus signal-to-noise ratio, but the direction of this dependency is reversed by aging. Critically, in older listeners and for the same late component, greater cortical exaggeration is correlated with decreased behavioral inhibitory control. This negative correlation also carries over to speech intelligibility in noise, where greater cortical exaggeration in older listeners is correlated with worse speech intelligibility scores. Finally, an age-related lateralization difference is also seen for the ∼100 ms latency peaks, where older listeners show a bilateral response compared with younger listeners’ right lateralization. Thus, this information-theory-based analysis provides new, and less coarse-grained, results regarding age-related change in auditory cortical speech processing, and its correlation with cognitive measures, compared with related linear measures. NEW & NOTEWORTHY Cortical representations of natural speech are investigated using a novel nonlinear approach based on mutual information. Cortical responses, phase-locked to the speech envelope, show an exaggerated level of mutual information associated with aging, appearing at several distinct latencies (∼50, ∼100, and ∼200 ms). Critically, for older listeners only, the ∼200 ms latency response components are correlated with specific behavioral measures, including behavioral inhibition and speech comprehension. 
    more » « less
  3. Speech processing is highly incremental. It is widely accepted that human listeners continuously use the linguistic context to anticipate upcoming concepts, words, and phonemes. However, previous evidence supports two seemingly contradictory models of how a predictive context is integrated with the bottom-up sensory input: Classic psycholinguistic paradigms suggest a two-stage process, in which acoustic input initially leads to local, context-independent representations, which are then quickly integrated with contextual constraints. This contrasts with the view that the brain constructs a single coherent, unified interpretation of the input, which fully integrates available information across representational hierarchies, and thus uses contextual constraints to modulate even the earliest sensory representations. To distinguish these hypotheses, we tested magnetoencephalography responses to continuous narrative speech for signatures of local and unified predictive models. Results provide evidence that listeners employ both types of models in parallel. Two local context models uniquely predict some part of early neural responses, one based on sublexical phoneme sequences, and one based on the phonemes in the current word alone; at the same time, even early responses to phonemes also reflect a unified model that incorporates sentence-level constraints to predict upcoming phonemes. Neural source localization places the anatomical origins of the different predictive models in nonidentical parts of the superior temporal lobes bilaterally, with the right hemisphere showing a relative preference for more local models. These results suggest that speech processing recruits both local and unified predictive models in parallel, reconciling previous disparate findings. Parallel models might make the perceptual system more robust, facilitate processing of unexpected inputs, and serve a function in language acquisition. MEG Data MEG data is in FIFF format and can be opened with MNE-Python. Data has been directly converted from the acquisition device native format without any preprocessing. Events contained in the data indicate the stimuli in numerical order. Subjects R2650 and R2652 heard stimulus 11b instead of 11. Predictor Variables The original audio files are copyrighted and cannot be shared, but the make_audio folder contains which can be used to extract the exact clips from the commercially available audiobook (ISBN 978-1480555280). The predictors directory contains all the predictors used in the original study as pickled eelbrain objects. They can be loaded in Python with the eelbrain.load.unpickle function. The TextGrids directory contains the TextGrids aligned to the audio files. Source Localization The file contains files needed for source localization. Structural brain models used in the published analysis are reconstructed by scaling the FreeSurfer fsaverage brain (distributed with FreeSurfer) based on each subject's `MRI scaling parameters.cfg` file. This can be done using the `mne.scale_mri` function. Each subject's MEG folder contains a `subject-trans.fif` file which contains the coregistration between MEG sensor space and (scaled) MRI space, which is used to compute the forward solution. 
    more » « less
  4. While a range of measures based on speech production, language, and perception are possible (Manun et al., 2020) for the prediction and estimation of speech intelligibility, what constitutes second language (L2) intelligibility remains under-defined. Prosodic and temporal features (i.e., stress, speech rate, rhythm, and pause placement) have been shown to impact listener perception (Kang et al., 2020). Still, their relationship with highly intelligible speech is yet unclear. This study aimed to characterize L2 speech intelligibility. Acoustic analyses, including PRAAT and Python scripts, were conducted on 405 speech samples (30 s) from 102 L2 English speakers with a wide variety of backgrounds, proficiency levels, and intelligibility levels. The results indicate that highly intelligible speakers of English employ between 2 and 4 syllables per second and that higher or lower speeds are less intelligible. Silent pauses between 0.3 and 0.8 s were associated with the highest levels of intelligibility. Rhythm, measured by Δ syllable length of all content syllables, was marginally associated with intelligibility. Finally, lexical stress accuracy did not interfere substantially with intelligibility until less than 70% of the polysyllabic words were incorrect. These findings inform the fields of first and second language research as well as language education and pathology.

    more » « less
  5. Abstract

    Speech processing often occurs amid competing inputs from other modalities, for example, listening to the radio while driving. We examined the extent to which dividing attention between auditory and visual modalities (bimodal divided attention) impacts neural processing of natural continuous speech from acoustic to linguistic levels of representation. We recorded electroencephalographic (EEG) responses when human participants performed a challenging primary visual task, imposing low or high cognitive load while listening to audiobook stories as a secondary task. The two dual-task conditions were contrasted with an auditory single-task condition in which participants attended to stories while ignoring visual stimuli. Behaviorally, the high load dual-task condition was associated with lower speech comprehension accuracy relative to the other two conditions. We fitted multivariate temporal response function encoding models to predict EEG responses from acoustic and linguistic speech features at different representation levels, including auditory spectrograms and information-theoretic models of sublexical-, word-form-, and sentence-level representations. Neural tracking of most acoustic and linguistic features remained unchanged with increasing dual-task load, despite unambiguous behavioral and neural evidence of the high load dual-task condition being more demanding. Compared to the auditory single-task condition, dual-task conditions selectively reduced neural tracking of only some acoustic and linguistic features, mainly at latencies >200 ms, while earlier latencies were surprisingly unaffected. These findings indicate that behavioral effects of bimodal divided attention on continuous speech processing occur not because of impaired early sensory representations but likely at later cognitive processing stages. Crossmodal attention-related mechanisms may not be uniform across different speech processing levels.

    more » « less