The manner in which acoustic features contribute to perceiving speaker identity remains unclear. In an attempt to better understand speaker perception, we investigated human and machine speaker discrimination with utterances shorter than 2 seconds. Sixty-five listeners performed a same vs. different task. Machine performance was estimated with i-vector/PLDA-based automatic speaker verification systems, one using mel-frequency cepstral coefficients (MFCCs) and the other using voice quality features (VQual2) inspired by a psychoacoustic model of voice quality. Machine performance was measured in terms of the detection and log-likelihood-ratio cost functions. Humans showed higher confidence for correct target decisions compared to correct non-target decisions, suggesting that they rely on different features and/or decision making strategies when identifying a single speaker compared to when distinguishing between speakers. For non-target trials, responses were highly correlated between humans and the VQual2-based system, especially when speakers were perceptually marked. Fusing human responses with an MFCC-based system improved performance over human-only or MFCC-only results, while fusing with the VQual2-based system did not. The study is a step towards understanding human speaker discrimination strategies and suggests that automatic systems might be able to supplement human decisions especially when speakers are marked.
more »
« less
Speaker discrimination performance for “easy” versus “hard” voices in style-matched and -mismatched speech
This study compares human speaker discrimination performance for read speech versus casual conversations and explores differences between unfamiliar voices that are “easy” versus “hard” to “tell together” versus “tell apart.” Thirty listeners were asked whether pairs of short style-matched or -mismatched, text-independent utterances represented the same or different speakers. Listeners performed better when stimuli were style-matched, particularly in read speech−read speech trials (equal error rate, EER, of 6.96% versus 15.12% in conversation–conversation trials). In contrast, the EER was 20.68% for the style-mismatched condition. When styles were matched, listeners' confidence was higher when speakers were the same versus different; however, style variation caused decreases in listeners' confidence for the “same speaker” trials, suggesting a higher dependency of this task on within-speaker variability. The speakers who were “easy” or “hard” to “tell together” were not the same as those who were “easy” or “hard” to “tell apart.” Analysis of speaker acoustic spaces suggested that the difference observed in human approaches to “same speaker” and “different speaker” tasks depends primarily on listeners' different perceptual strategies when dealing with within- versus between-speaker acoustic variability.
more »
« less
- Award ID(s):
- 1704167
- PAR ID:
- 10363529
- Publisher / Repository:
- Acoustical Society of America (ASA)
- Date Published:
- Journal Name:
- The Journal of the Acoustical Society of America
- Volume:
- 151
- Issue:
- 2
- ISSN:
- 0001-4966
- Page Range / eLocation ID:
- p. 1393-1403
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Previous research in speech perception has shown that perception is influenced by social factors that can result in behavioral consequences such as reduced intelligibility (i.e., a listeners’ ability to transcribe the speech they hear). However, little is known about these effects regarding Spanish speakers’ perception of heritage Spanish, Spanish spoken by individuals who have an ancestral and cultural connection to the Spanish language. Given that ideologies within the U.S. Latino community often equate Latino identity to speaking Spanish “correctly” and proficiently, there is a clear need to understand the potential influence these ideologies have on speech perception. Using a matched-guised methodology, we analyzed the influence of speaker social background information and listener social background information on speech perception. Participants completed a transcription task in which four different Spanish heritage speakers were paired with different social guises to determine if the speakers were perceived as equally intelligible under each guise condition. The results showed that social guise and listener social variables did not significantly predict intelligibility scores. We argue that the unique socio-political culture within the U.S. Latino community may lead to different effects of language ideology and social expectation on speech perception than what has been documented in previous work.more » « less
-
In the language development literature, studies often make inferences about infants’ speech perception abilities based on their responses to a single speaker. However, there can be significant natural variability across speakers in how speech is produced (i.e., inter-speaker differences). The current study examined whether inter-speaker differences can affect infants’ ability to detect a mismatch between the auditory and visual components of vowels. Using an eye-tracker, 4.5-month-old infants were tested on auditory-visual (AV) matching for two vowels (/i/ and /u/). Critically, infants were tested with two speakers who naturally differed in how distinctively they articulated the two vowels within and across the categories. Only infants who watched and listened to the speaker whose visual articulations of the two vowels were most distinct from one another were sensitive to AV mismatch. This speaker also produced a visually more distinct /i/ as compared to the other speaker. This finding suggests that infants are sensitive to the distinctiveness of AV information across speakers, and that when making inferences about infants’ perceptual abilities, characteristics of the speaker should be taken into account.more » « less
-
Abstract Communicating with a speaker with a different accent can affect one’s own speech. Despite the strength of evidence for perception-production transfer in speech, the nature of transfer has remained elusive, with variable results regarding the acoustic properties that transfer between speakers and the characteristics of the speakers who exhibit transfer. The current study investigates perception-production transfer through the lens of statistical learning across passive exposure to speech. Participants experienced a short sequence of acoustically variable minimal pair (beer/pier) utterances conveying either an accent or typical American English acoustics, categorized a perceptually ambiguous test stimulus, and then repeated the test stimulus aloud. In thecanonicalcondition, /b/–/p/ fundamental frequency (F0) and voice onset time (VOT) covaried according to typical English patterns. In thereversecondition, the F0xVOT relationship reversed to create an “accent” with speech input regularities atypical of American English. Replicating prior studies, F0 played less of a role in perceptual speech categorization in reverse compared with canonical statistical contexts. Critically, this down-weighting transferred to production, with systematic down-weighting of F0 in listeners’ own speech productions in reverse compared with canonical contexts that was robust across male and female participants. Thus, the mapping of acoustics to speech categories is rapidly adjusted by short-term statistical learning across passive listening and these adjustments transfer to influence listeners’ own speech productions.more » « less
-
This study replicates and extends the recent findings of Lee, Keating, and Kreiman [J. Acoust. Soc. Am. 146(3), 1568–1579 (2019)] on acoustic voice variation in read speech, which showed remarkably similar acoustic voice spaces for groups of female and male talkers and the individual talkers within these groups. Principal component analysis was applied to acoustic indices of voice quality measured from phone conversations for 99/100 of the same talkers studied previously. The acoustic voice spaces derived from spontaneous speech are highly similar to those based on read speech, except that unlike read speech, variability in fundamental frequency accounted for significant acoustic variability. Implications of these findings for prototype models of speaker recognition and discrimination are considered.more » « less
An official website of the United States government
