Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

Fan, Ruchao; Balaji_Shankar, Natarajan; Alwan, Abeer

doi:10.21437/Interspeech.2024-1353

Citation Details

Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

Speech foundation models (SFMs) have achieved state-of- the-art results for various speech tasks in supervised (e.g. Whis- per) or self-supervised systems (e.g. WavLM). However, the performance of SFMs for child ASR has not been systemati- cally studied. In addition, there is no benchmark for child ASR with standard evaluations, making the comparisons of novel ideas difficult. In this paper, we initiate and present a compre- hensive benchmark on several child speech databases based on various SFMs (Whisper, Wav2vec2.0, HuBERT, and WavLM). Moreover, we investigate finetuning strategies by comparing various data augmentation and parameter-efficient finetuning (PEFT) methods. We observe that the behaviors of these meth- ods are different when the model size increases. For example, PEFT matches the performance of full finetuning for large mod- els but worse for small models. To stabilize finetuning using augmented data, we propose a perturbation invariant finetuning (PIF) loss as a regularization. more »

Award ID(s):: 2202585

PAR ID:: 10582852

Author(s) / Creator(s):: Fan, Ruchao; Balaji_Shankar, Natarajan; Alwan, Abeer

Publisher / Repository:: ISCA Interspeech Proceeding

Date Published:: 2024-09-01

Page Range / eLocation ID:: 5173 to 5177

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.21437/Interspeech.2024-1353

More Like this