NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The JIBO Kids Corpus: A speech dataset of child-robot interactions in a classroom environment

https://doi.org/10.1121/10.0034195

Shankar, Natarajan_Balaji; Afshan, Amber; Johnson, Alexander; Mahapatra, Aurosweta; Martin, Alejandra; Ni, Haolun; Park, Hae_Won; Perez, Marlen_Quintero; Yeung, Gary; Bailey, Alison; et al (November 2024, JASA Express Letters)

This paper describes an original dataset of children's speech, collected through the use of JIBO, a social robot. The dataset encompasses recordings from 110 children, aged 4–7 years old, who participated in a letter and digit identification task and extended oral discourse tasks requiring explanation skills, totaling 21 h of session data. Spanning a 2-year collection period, this dataset contains a longitudinal component with a subset of participants returning for repeat recordings. The dataset, with session recordings and transcriptions, is publicly available, providing researchers with a valuable resource to advance investigations into child language development.
more » « less
Attention-based conditioning methods using variable frame rate for style-robust speaker verification

https://doi.org/10.21437/Interspeech.2022-882

Afshan, Amber; Alwan, Abeer (September 2022, Interspeech Proceedings, a publication of the Int. Speech Comm. Assoc. (ISCA))

Full Text Available
Learning from human perception to improve automatic speaker verification in style-mismatched conditions

https://doi.org/10.21437/Interspeech.2022-883

Afshan, Amber; Alwan, Abeer (September 2022, Interspeech Proceedings)

Full Text Available
Bi-APC: Bidirectional Autoregressive Predictive Coding for Unsupervised Pre-Training and its Application to Children’s ASR

https://doi.org/10.1109/ICASSP39728.2021.9414970

Fan, Ruchao; Afshan, Amber; Alwan, Abeer (June 2021, IEEE ICASSP 2021)
null (Ed.)
Full Text Available
Speaker Discrimination in Humans and Machines: Effects of Speaking Style Variability

https://doi.org/10.21437/Interspeech.2020-3004

Afshan, Amber; Kreiman, Jody; Alwan, Abeer (October 2020, Interspeech 2020)
null (Ed.)
Full Text Available
Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification

https://doi.org/10.21437/Interspeech.2020-2957

Ravi, Vijay; Fan, Ruchao; Afshan, Amber; Lu, Huanhua; Alwan, Abeer (October 2020, Interspeech 2020)
null (Ed.)
Full Text Available
Variable Frame Rate-Based Data Augmentation to Handle Speaking-Style Variability for Automatic Speaker Verification

https://doi.org/10.21437/Interspeech.2020-3006

Afshan, Amber; Guo, Jinxi; Park, Soo Jin; Ravi, Vijay; McCree, Alan; Alwan, Abeer (October 2020, Interspeech 2020)
null (Ed.)
Full Text Available
Voice Quality and Between-Frame Entropy for Sleepiness Estimation

https://doi.org/10.21437/Interspeech.2019-2988

Ravi, Vijay; Park, Soo Jin; Afshan, Amber; Alwan, Abeer (September 2019, Interspeech 2019)

Full Text Available
Speaker discrimination performance for “easy” versus “hard” voices in style-matched and -mismatched speech

https://doi.org/10.1121/10.0009585

Afshan, Amber; Kreiman, Jody; Alwan, Abeer (February 2022, The Journal of the Acoustical Society of America)

This study compares human speaker discrimination performance for read speech versus casual conversations and explores differences between unfamiliar voices that are “easy” versus “hard” to “tell together” versus “tell apart.” Thirty listeners were asked whether pairs of short style-matched or -mismatched, text-independent utterances represented the same or different speakers. Listeners performed better when stimuli were style-matched, particularly in read speech−read speech trials (equal error rate, EER, of 6.96% versus 15.12% in conversation–conversation trials). In contrast, the EER was 20.68% for the style-mismatched condition. When styles were matched, listeners' confidence was higher when speakers were the same versus different; however, style variation caused decreases in listeners' confidence for the “same speaker” trials, suggesting a higher dependency of this task on within-speaker variability. The speakers who were “easy” or “hard” to “tell together” were not the same as those who were “easy” or “hard” to “tell apart.” Analysis of speaker acoustic spaces suggested that the difference observed in human approaches to “same speaker” and “different speaker” tasks depends primarily on listeners' different perceptual strategies when dealing with within- versus between-speaker acoustic variability.
more » « less
Target and Non-target Speaker Discrimination by Humans and Machines

https://doi.org/10.1109/ICASSP.2019.8683362

Park, Soo Jin; Afshan, Amber; Kreiman, Jody; Yeung, Gary; Alwan, Abeer (May 2019, IEEE ICASSP 2019)

The manner in which acoustic features contribute to perceiving speaker identity remains unclear. In an attempt to better understand speaker perception, we investigated human and machine speaker discrimination with utterances shorter than 2 seconds. Sixty-five listeners performed a same vs. different task. Machine performance was estimated with i-vector/PLDA-based automatic speaker verification systems, one using mel-frequency cepstral coefficients (MFCCs) and the other using voice quality features (VQual2) inspired by a psychoacoustic model of voice quality. Machine performance was measured in terms of the detection and log-likelihood-ratio cost functions. Humans showed higher confidence for correct target decisions compared to correct non-target decisions, suggesting that they rely on different features and/or decision making strategies when identifying a single speaker compared to when distinguishing between speakers. For non-target trials, responses were highly correlated between humans and the VQual2-based system, especially when speakers were perceptually marked. Fusing human responses with an MFCC-based system improved performance over human-only or MFCC-only results, while fusing with the VQual2-based system did not. The study is a step towards understanding human speaker discrimination strategies and suggests that automatic systems might be able to supplement human decisions especially when speakers are marked.
more » « less
Full Text Available

« Prev Next »

Search for: All records