skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2008043

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. In this work, we explore how a real time reading tracker can be built efficiently for children’s voices. While previously proposed reading trackers focused on ASR-based cascaded approaches, we propose a fully end-to-end model making it less prone to lags in voice tracking. We employ a pointer network that directly learns to predict positions in the ground truth text conditioned on the streaming speech. To train this pointer network, we generate ground truth training signals by using forced alignment between the read speech and the text being read on the training set. Exploring different forced alignment models, we find a neural attention based model is at least as close in alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better training signal for the pointer network. Our results are reported on one adult speech data (TIMIT) and two children’s speech datasets (CMU Kids and Reading Races). Our best model can accurately track adult speech with 87.8% accuracy and the much harder and disfluent children’s speech with 77.1% accuracy on CMU Kids data and a 65.3% accuracy on the Reading Races dataset. 
    more » « less
  2. This corpus was collected in the Language Sciences Research Lab, a working lab embedded inside of a science museum: the Center of Science and Industry in Columbus, Ohio, USA. Participants were recruited from the floor of the museum and run in a semi-public space. Three distinctive features of the corpus are: (1) an interactive social robot (specifically, a Jibo robot) was present and participated in the sessions for roughly half the children; (2) all children were recorded with a lapel mic generating high quality audio (available through CHILDES), as well as a distal table mic generating low quality audio (available on request) to facilitate strong tests of automated speech processing on the data; and (3) the data were collected in the peri-pandemic period, beginning in the summer of 2021 just after COVID-19 restrictions were being eased and ending in the summer of 2022 – thus providing a snapshot of language development in a distinctive time of the world. A YouTube video on the Jibo robot is available here . 
    more » « less
  3. Dialog history enhances downstream classification performance in both speech and text based dialog systems. However, there still exists a gap in dialog history integration in a fully end-to-end (E2E) spoken dialog system (SDS) versus a textual dia- log system. Text-based dialog systems use large language models (LLMs) to encode long-range dependencies by attending to the entire conversation as a contiguous token sequence. This is not possible in an E2E SDS, as speech sequences can be intractably long. We propose a convolution subsampling approach to make the speech sequence of a conversation tractable and use a conformer to attend to the speech-based conversation in a fine-grained manner. This model is further enhanced via a conversation-level knowledge transfer from a LLM using a token-level alignment strategy. Finetuning the E2E model pretrained this way gives significant gains, of up to 8%, over strong non-contextual baselines in the E2E dialog act classification task on two datasets. 
    more » « less
  4. RNN Tranducer (RNN-T) technology is very popular for building deployable models for end-to-end (E2E) automatic speech recognition (ASR) and spoken language understanding (SLU). Since these are E2E models operating on speech directly, there remains a potential to improve their performance using purely text based models like BERT, which have strong language understanding capabilities. In this paper, we propose a new training criteria for RNN-T based E2E ASR and SLU to transfer BERT’s knowledge into these systems. In the first stage of our proposed mechanism, we improve ASR performance by using a fine-grained, tokenwise knowledge transfer from BERT. In the second stage, we fine-tune the ASR model for SLU such that the above knowledge is explicitly utilized by the RNN-T model for improved performance. Our techniques improve ASR performance on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and on the recently released SLURP dataset on which we achieve a new state-of-the-art performance. For SLU, we show significant improvements on the SLURP slot filling task, outperforming HuBERT-base and reaching a performance close to HuBERTlarge. Compared to large transformer based speech models like HuBERT, our model is significantly more compact and uses only 300 hours of speech pretraining data. 
    more » « less
  5. Disfluency detection and classification on children’s speech has a great potential for teaching reading skills. Word-level assessment of children’s speech can help teachers to effectively gauge their students’ progress. Hence, we propose a novel attention-based model to perform word-level disfluency detection and classification in a fully end-to-end (E2E) manner making it fast and easy to use. We develop a word-level disfluency annotation scheme using which we annotate a dataset of children read speech, the reading races dataset (READR). We also annotate disfluencies in the existing CMU Kids corpus. The proposed model significantly outperforms traditional cascaded baselines, which use forced alignments, on both datasets. To deal with the inevitable class-imbalance in the datasets, we propose a novel technique called HiDeC (Hierarchical Detection and Classification) which yields a detection improvement of 23% and 16% and a classification improvement of 3.8% and 19.3% relative F1-score on the READR and CMU Kids datasets respectively. 
    more » « less