- Award ID(s):
- 2120834
- PAR ID:
- 10425827
- Date Published:
- Journal Name:
- Proceedings of Interspeech
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Listeners typically rely more on one aspect of the speech signal than another when categorizing speech sounds. This is known as feature weighting. We present a rate distortion theory model of feature weighting and use it to ask whether human listeners select feature weights simply by mirroring the feature reliabilities that are present in their input. We show that there is an additional component (selective attention) listeners appear to use that is not reflected by the input statistics. This suggests that an internal mechanism is at play in governing listeners' weighting of different aspects of the speech signal, in addition to tracking statistics.more » « less
-
Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100 h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters.more » « less
-
Listening to speech in noise can require substantial mental effort, even among younger normal-hearing adults. The task-evoked pupil response (TEPR) has been shown to track the increased effort exerted to recognize words or sentences in increasing noise. However, few studies have examined the trajectory of listening effort across longer, more natural, stretches of speech, or the extent to which expectations about upcoming listening difficulty modulate the TEPR. Seventeen younger normal-hearing adults listened to 60-s-long audiobook passages, repeated three times in a row, at two different signal-to-noise ratios (SNRs) while pupil size was recorded. There was a significant interaction between SNR, repetition, and baseline pupil size on sustained listening effort. At lower baseline pupil sizes, potentially reflecting lower attention mobilization, TEPRs were more sustained in the harder SNR condition, particularly when attention mobilization remained low by the third presentation. At intermediate baseline pupil sizes, differences between conditions were largely absent, suggesting these listeners had optimally mobilized their attention for both SNRs. Lastly, at higher baseline pupil sizes, potentially reflecting over-mobilization of attention, the effect of SNR was initially reversed for the second and third presentations: participants initially appeared to disengage in the harder SNR condition, resulting in reduced TEPRs that recovered in the second half of the story. Together, these findings suggest that the unfolding of listening effort over time depends critically on the extent to which individuals have successfully mobilized their attention in anticipation of difficult listening conditions.more » « less
-
Listening to speech in noise can require substantial mental effort, even among younger normal-hearing adults. The task-evoked pupil response (TEPR) has been shown to track the increased effort exerted to recognize words or sentences in increasing noise. However, few studies have examined the trajectory of listening effort across longer, more natural, stretches of speech, or the extent to which expectations about upcoming listening difficulty modulate the TEPR. Seventeen younger normal-hearing adults listened to 60-s-long audiobook passages, repeated three times in a row, at two different signal-to-noise ratios (SNRs) while pupil size was recorded. There was a significant interaction between SNR, repetition, and baseline pupil size on sustained listening effort. At lower baseline pupil sizes, potentially reflecting lower attention mobilization, TEPRs were more sustained in the harder SNR condition, particularly when attention mobilization remained low by the third presentation. At intermediate baseline pupil sizes, differences between conditions were largely absent, suggesting these listeners had optimally mobilized their attention for both SNRs. Lastly, at higher baseline pupil sizes, potentially reflecting overmobilization of attention, the effect of SNR was initially reversed for the second and third presentations: participants initially appeared to disengage in the harder SNR condition, resulting in reduced TEPRs that recovered in the second half of the story. Together, these findings suggest that the unfolding of listening effort over time depends critically on the extent to which individuals have successfully mobilized their attention in anticipation of difficult listening conditions.
-
Abstract Parental responsiveness to infant behaviors is a strong predictor of infants' language and cognitive outcomes. The mechanisms underlying this effect, however, are relatively unknown. We examined the effects of parent speech on infants' visual attention, manual actions, hand‐eye coordination, and dyadic joint attention during parent‐infant free play. We report on two studies that used head‐mounted eye trackers in increasingly naturalistic laboratory environments. In Study 1, 12‐to‐24‐month‐old infants and their parents played on the floor of a seminaturalistic environment with 24 toys. In Study 2, a different sample of dyads played in a home‐like laboratory with 10 toys and no restrictions on their movement. In both studies, we present evidence that responsive parent speech extends the duration of infants' multimodal attention. This social “boost” of parent speech impacts multiple behaviors that have been linked to later outcomes—visual attention, manual actions, hand‐eye coordination, and joint attention. Further, the amount that parents talked during the interaction was negatively related to the effects of parent speech on infant attention. Together, these results provide evidence of a trade‐off between quantity of speech and its effects, suggesting multiple pathways through which parents impact infants' multimodal attention to shape the moment‐by‐moment dynamics of an interaction.