skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Comparing human and machine's use of coarticulatory vowel nasalization for linguistic classification
Anticipatory coarticulation is a highly informative cue to upcoming linguistic information: listeners can identify that the word is ben and not bed by hearing the vowel alone. The present study compares the relative performances of human listeners and a self-supervised pre-trained speech model (wav2vec 2.0) in the use of nasal coarticulation to classify vowels. Stimuli consisted of nasalized (from CVN words) and non-nasalized (from CVCs) American English vowels produced by 60 humans and generated in 36 TTS voices. wav2vec 2.0 performance is similar to human listener performance, in aggregate. Broken down by vowel type: both wav2vec 2.0 and listeners perform higher for non-nasalized vowels produced naturally by humans. However, wav2vec 2.0 shows higher correct classification performance for nasalized vowels, than for non-nasalized vowels, for TTS voices. Speaker-level patterns reveal that listeners' use of coarticulation is highly variable across talkers. wav2vec 2.0 also shows cross-talker variability in performance. Analyses also reveal differences in the use of multiple acoustic cues in nasalized vowel classifications across listeners and the wav2vec 2.0. Findings have implications for understanding how coarticulatory variation is used in speech perception. Results also can provide insight into how neural systems learn to attend to the unique acoustic features of coarticulation.  more » « less
Award ID(s):
2140183
PAR ID:
10603924
Author(s) / Creator(s):
; ;
Publisher / Repository:
JASA
Date Published:
Journal Name:
The Journal of the Acoustical Society of America
Volume:
156
Issue:
1
ISSN:
0001-4966
Page Range / eLocation ID:
489 to 502
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This study examines apparent-time variation in the use of multiple acoustic cues present on coarticulatorily nasalized vowels in California English. Eighty-nine listeners ranging in age from 18-58 (grouped into 3 apparent-time categories based on year of birth) performed lexical identifications on syllables excised from words with oral and nasal codas from six speakers who produced either minimal (n=3) or extensive (n=3) anticipatory nasal coarticulation (realized by greater vowel nasalization, F1 bandwidth, and diphthongization on vowels in CVN contexts). Results showed no differences across listeners’ identification for Extensively coarticulated vowels, as well as oral vowels by both types of speakers (all at-ceiling). Yet, performance for the Minimal Coarticulators’ nasalized vowels was lowest for the older listener group and increased over apparent-time. Perceptual cue-weighting analyses revealed that older listeners rely more on F1 bandwidth, while younger listeners rely more on acoustic nasality, as coarticulatory cues providing information about lexical identity. Thus, there is evidence for variation in apparent- time in the use of the different coarticulatory cues present on vowels. Younger listeners’ cue weighting allows them flexibility to identify lexical items given a range of coarticulatory variation across (here, younger) speakers, while older listeners’ cue weighting leads to reduced performance for talkers producing innovative phonetic forms. This study contributes to our understanding of the relationship between multidimensional acoustic features resulting from coarticulation and the perceptual re-weighting of cues that can lead to sound change over time. 
    more » « less
  2. null (Ed.)
    This study tests speech-in-noise perception and social ratings of speech produced by different text-to-speech (TTS) synthesis methods. We used identical speaker training datasets for a set of 4 voices (using AWS Polly TTS), generated using neural and concatenative TTS. In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than for concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes. Neural TTS was rated as more human-like, natural, likeable, and familiar than concatenative TTS. Furthermore, how natural listeners rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech — and that these patterns are linked. Overall, this work contributes to our understanding of the of the nexus of speech technology and human speech perception. 
    more » « less
  3. This study investigates how California English speakers adjust nasal coarticulation and hyperarticulation on vowels across three speech styles: speaking slowly and clearly (imagining a hard-of-hearing addressee), casually (imagining a friend/family member addressee), and speaking quickly and clearly (imagining being an auctioneer). Results show covariation in speaking rate and vowel hyperarticulation across the styles. Additionally, results reveal that speakers produce more extensive anticipatory nasal coarticulation in the slow-clear speech style, in addition to a slower speech rate. These findings are interpreted in terms of accounts of coarticulation in which speakers selectively tune their production of nasal coarticulation based on the speaking style. 
    more » « less
  4. This study investigates apparent-time variation in the production of anticipatory nasal coarticulation in California English. Productions of consonant-vowel-nasal words in clear vs casual speech by 58 speakers aged 18–58 (grouped into three generations) were analyzed for degree of coarticulatory vowel nasality. Results reveal an interaction between age and style: the two younger speaker groups produce greater coarticulation (measured as A1-P0) in clear speech, whereas older speakers produce less variable coarticulation across styles. Yet, duration lengthening in clear speech is stable across ages. Thus, age- and style-conditioned changes in produced coarticulation interact as part of change in coarticulation grammars over time. 
    more » « less
  5. Purpose:This study examined the race identification of Southern American English speakers from two geographically distant regions in North Carolina. The purpose of this work is to explore how talkers' self-identified race, talker dialect region, and acoustic speech variables contribute to listener categorization of talker races. Method:Two groups of listeners heard a series of /h/–vowel–/d/ (/hVd/) words produced by Black and White talkers from East and West North Carolina, respectively. Results:Both Southern (North Carolina) and Midland (Indiana) listeners accurately categorized the race of all speakers with greater-than-chance accuracy; however, Western North Carolina Black talkers were categorized with the lowest accuracy, just above chance. Conclusions:The results suggest that similarities in the speech production patterns of West North Carolina Black and White talkers affect the racial categorization of Black, but not White talkers. The results are discussed with respect to the acoustic spectral features of the voices present in the sample population. 
    more » « less