Creators/Authors contains: "Afshan, Amber"

  2. This study compares human speaker discrimination performance for read speech versus casual conversations and explores differences between unfamiliar voices that are “easy” versus “hard” to “tell together” versus “tell apart.” Thirty listeners were asked whether pairs of short style-matched or -mismatched, text-independent utterances represented the same or different speakers. Listeners performed better when stimuli were style-matched, particularly in read speech−read speech trials (equal error rate, EER, of 6.96% versus 15.12% in conversation–conversation trials). In contrast, the EER was 20.68% for the style-mismatched condition. When styles were matched, listeners' confidence was higher when speakers were the same versus different; however, style variation caused decreases in listeners' confidence for the “same speaker” trials, suggesting a higher dependency of this task on within-speaker variability. The speakers who were “easy” or “hard” to “tell together” were not the same as those who were “easy” or “hard” to “tell apart.” Analysis of speaker acoustic spaces suggested that the difference observed in human approaches to “same speaker” and “different speaker” tasks depends primarily on listeners' different perceptual strategies when dealing with within- versus between-speaker acoustic variability.

  3. The manner in which acoustic features contribute to perceiving speaker identity remains unclear. In an attempt to better understand speaker perception, we investigated human and machine speaker discrimination with utterances shorter than 2 seconds. Sixty-five listeners performed a same vs. different task. Machine performance was estimated with i-vector/PLDA-based automatic speaker verification systems, one using mel-frequency cepstral coefficients (MFCCs) and the other using voice quality features (VQual2) inspired by a psychoacoustic model of voice quality. Machine performance was measured in terms of the detection and log-likelihood-ratio cost functions. Humans showed higher confidence for correct target decisions compared to correct non-target decisions, suggesting that they rely on different features and/or decision making strategies when identifying a single speaker compared to when distinguishing between speakers. For non-target trials, responses were highly correlated between humans and the VQual2-based system, especially when speakers were perceptually marked. Fusing human responses with an MFCC-based system improved performance over human-only or MFCC-only results, while fusing with the VQual2-based system did not. The study is a step towards understanding human speaker discrimination strategies and suggests that automatic systems might be able to supplement human decisions especially when speakers are marked. 
  4. This pilot study investigated the feasibility of implementing child-friendly robots for administering clinical and educational assessments with young children. JIBO, a social robot, was used as a new interface to administer a letter and number naming task and the 3rd Goldman Fristoe Test of Articulation (GFTA-3). The reason for using these assessment materials is to develop robust automatic speech recognition (ASR) and automated social interaction systems that can aid in administering such assessments more efficiently. The voice of JIBO simulates interaction with a peer, and images and playful transitions are displayed on JIBO’s face/screen. Several preliminary observations with 15 pre-kindergarten and 18 kindergarten students included the rate of task completion and strategies to increase student participation. Changes to the length and prompt delivery of the assessment protocol were considered based on these observations, and further observations are planned for future work with an additional cohort of 43 prekindergarten and 50 kindergarten students. Recommendations are given to inform future implementations and analyses. 
