NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Selective Attention Merging for low resource tasks: A case study of Child ASR

https://doi.org/10.1109/ICASSP49660.2025.10887889

Shankar, Natarajan Balaji; Wang, Zilai; Eren, Eray; Alwan, Abeer (April 2025, IEEE Signal Processing Society)

While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. This paper also introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors from attention matrices to enhance SFM performance on low-resource tasks. Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques. By combining data augmentation techniques with SA Merge, we achieve a new state-of-the-art WER of 8.69 on the MyST database for the Whisper-small model, highlighting the potential of SA Merge for improving low-resource ASR.
more » « less
Free, publicly-accessible full text available April 6, 2026
The Ohio Child Speech Corpus

https://doi.org/10.1016/j.specom.2025.103206

Wagner, Laura; Alghowinhem, Sharifa; Alwan, Abeer; Bowdrie, Kristina; Breazeal, Cynthia; Clopper, Cynthia G; Fosler-Lussier, Eric; Jamsek, Izabela A; Lander, Devan; Ramnath, Rajiv; et al (January 2025, Speech Communication)

This paper reports on the creation and composition of a new corpus of children’s speech, the Ohio Child Speech Corpus, which is publicly available on the Talkbank-CHILDES website. The audio corpus contains speech samples from 303 children ranging in age from 4 – 9 years old, all of whom participated in a seven-task elicitation protocol conducted in a science museum lab. In addition, an interactive social robot controlled by the researchers joined the sessions for approximately 60% of the children, and the corpus itself was collected in the peri‑pandemic period. Two analyses are reported that highlighted these last two features. One set of analyses found that the children spoke significantly more in the presence of the robot relative to its absence, but no effects of speech complexity (as measured by MLU) were found for the robot’s presence. Another set of analyses compared children tested immediately post-pandemic to children tested a year later on two school-readiness tasks, an Alphabet task and a Reading Passages task. This analysis showed no negative impact on these tasks for our highly-educated sample of children just coming off of the pandemic relative to those tested later. These analyses demonstrate just two possible types of questions that this corpus could be used to investigate.
more » « less
Free, publicly-accessible full text available January 5, 2026
The JIBO Kids Corpus: A speech dataset of child-robot interactions in a classroom environment

https://doi.org/10.1121/10.0034195

Shankar, Natarajan_Balaji; Afshan, Amber; Johnson, Alexander; Mahapatra, Aurosweta; Martin, Alejandra; Ni, Haolun; Park, Hae_Won; Perez, Marlen_Quintero; Yeung, Gary; Bailey, Alison; et al (November 2024, JASA Express Letters)

This paper describes an original dataset of children's speech, collected through the use of JIBO, a social robot. The dataset encompasses recordings from 110 children, aged 4–7 years old, who participated in a letter and digit identification task and extended oral discourse tasks requiring explanation skills, totaling 21 h of session data. Spanning a 2-year collection period, this dataset contains a longitudinal component with a subset of participants returning for repeat recordings. The dataset, with session recordings and transcriptions, is publicly available, providing researchers with a valuable resource to advance investigations into child language development.
more » « less
Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models

https://doi.org/10.21437/Interspeech.2024-1353

Fan, Ruchao; Balaji_Shankar, Natarajan; Alwan, Abeer (September 2024, ISCA Interspeech Proceeding)

Speech foundation models (SFMs) have achieved state-of- the-art results for various speech tasks in supervised (e.g. Whis- per) or self-supervised systems (e.g. WavLM). However, the performance of SFMs for child ASR has not been systemati- cally studied. In addition, there is no benchmark for child ASR with standard evaluations, making the comparisons of novel ideas difficult. In this paper, we initiate and present a compre- hensive benchmark on several child speech databases based on various SFMs (Whisper, Wav2vec2.0, HuBERT, and WavLM). Moreover, we investigate finetuning strategies by comparing various data augmentation and parameter-efficient finetuning (PEFT) methods. We observe that the behaviors of these meth- ods are different when the model size increases. For example, PEFT matches the performance of full finetuning for large mod- els but worse for small models. To stabilize finetuning using augmented data, we propose a perturbation invariant finetuning (PIF) loss as a regularization.
more » « less
Full Text Available
CORAAL QA: A Dataset and Framework for Open Domain Spontaneous Speech Question Answering from Long Audio Files

https://doi.org/10.1109/ICASSP48485.2024.10447109

Shankar, Natarajan Balaji; Johnson, Alexander; Chance, Christina; Veeramani, Hariram; Alwan, Abeer (April 2024, Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing)

This paper presents a novel dataset (CORAAL QA) and framework for audio question-answering from long audio recordings contain- ing spontaneous speech. The dataset introduced here provides sets of questions that can be factually answered from short spans of a long audio files (typically 30min to 1hr) from the Corpus of Re- gional African American Language. Using this dataset, we divide the audio recordings into 60 second segments, automatically tran- scribe each segment, and use PLDA scoring of BERT-based seman- tic embeddings to rank the relevance of ASR transcript segments in answering the target question. In order to improve this framework through data augmentation, we use large language models including ChatGPT and Llama 2 to automatically generate further training ex- amples and show how prompt engineering can be optimized for this process. By creatively leveraging knowledge from large-language models, we achieve state-of-the-art question-answering performance in this information retrieval task.
more » « less
SOA: Reducing Domain Mismatch in SSL Pipeline by Speech Only Adaptation for Low Resource ASR

https://doi.org/10.1109/ICASSPW62465.2024.10625884

Shankar, Natarajan Balaji; Fan, Ruchao; Alwan, Abeer (April 2024, IEEE Signal Processing Society)

Recently, speech foundation models have gained popularity due to their superiority in finetuning downstream ASR tasks. However, models finetuned on certain domains, such as LibriSpeech (adult read speech), behave poorly on other domains (child or noisy speech). One solution could be collecting as much labeled and diverse data as possible for joint finetuning on various domains. However, collecting target domain speech-text paired data and retraining the model is often costly and computationally expensive. In this paper, we introduce a simple yet effective method, speech only adaptation (SOA), based on speech foundation models (Wav2vec 2.0), which requires only speech input data from the target domain. Specifically, the Wav2vec 2.0 feature encoder is continually pretrained with the Wav2vec 2.0 loss on both the source and target domain data for domain adaptation, while the contextual encoder is frozen. Compared to a source domain finetuned model with the feature encoder being frozen during training, we find that replacing the frozen feature encoder with the adapted one provides significant WER improvements to the target domain while preserving the performance of the source domain. The effectiveness of SOA is examined on various low resource or domain mismatched ASR settings, including adult-child and clean-noisy speech.
more » « less
Full Text Available
SOA: Reducing domain mismatch in SSL Pipeline by Speech Only Adaptation for low resource ASR

Natarajan Balaji Shankar, Ruchao Fan (April 2024, IEEE ICASSP Workdhop)

Full Text Available
An exploratory study on dialect density estimation for children and adult's African American English

https://doi.org/10.1121/10.0025771

Johnson, Alexander; Shankar, Natarajan Balaji; Ostendorf, Mari; Alwan, Abeer (April 2024, The Journal of the Acoustical Society of America)

This paper evaluates an innovative framework for spoken dialect density prediction on children's and adults' African American English. A speaker's dialect density is defined as the frequency with which dialect-specific language characteristics occur in their speech. Rather than treating the presence or absence of a target dialect in a user's speech as a binary decision, instead, a classifier is trained to predict the level of dialect density to provide a higher degree of specificity in downstream tasks. For this, self-supervised learning representations from HuBERT, handcrafted grammar-based features extracted from ASR transcripts, prosodic features, and other feature sets are experimented with as the input to an XGBoost classifier. Then, the classifier is trained to assign dialect density labels to short recorded utterances. High dialect density level classification accuracy is achieved for child and adult speech and demonstrated robust performance across age and regional varieties of dialect. Additionally, this work is used as a basis for analyzing which acoustic and grammatical cues affect machine perception of dialect.
more » « less
Full Text Available
UniEnc-CASSNAT: An Encoder-Only Non-Autoregressive ASR for Speech SSL Models

https://doi.org/10.1109/LSP.2024.3365036

Fan, Ruchao; Shankar, Natarajan Balaji; Alwan, Abeer (January 2024, IEEE Signal Processing Letters)

Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100 h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters.
more » « less
Full Text Available
An Equitable Framework for Automatically Assessing Children's Oral Narrative Language Abilities

https://doi.org/10.21437/Interspeech.2023-1257

Johnson, Alexander; Veeramani, Hariram; Balaji Shankar, Natarajan; Alwan, Abeer (August 2023, Prodeedings of Interspeech 2023)

This work proposes a novel framework for automatically scor- ing children’s oral narrative language abilities. We use audio recordings from 3rd-8th graders of the Atlanta, Georgia area as they take a portion of the Test of Narrative Language. We de- sign a system which extracts linguistic features and fine-tuned BERT-based self-supervised learning representation from state- of-the-art ASR transcripts. We predict manual test scores from the extracted features. This framework significantly outper- forms a deterministic method based on the assessment’s scoring rubric. Last, we evaluate the system performance across stu- dent’s reading level, dialect, and diagnosed learning/language disabilities to establish fairness across diverse demographics of students. Using this system, we achieve approximately 98% classification accuracy of student scores. We are also able to identify key areas of improvement for this type of system across demographic areas and reading ability.
more » « less
Full Text Available

« Prev Next »

Search for: All records