Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

Sari, Leda; Thomas, Samuel; Hasegawa-Johnson, Mark

doi:10.1109/ICASSP40776.2020.9054664

Citation Details

Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

End-to-end spoken language understanding (SLU) systems are typically trained on large amounts of data. In many practical scenarios, the amount of labeled speech is often limited as opposed to text. In this study, we investigate the use of non-parallel speech and text to improve the performance of dialog act recognition as an example SLU task. We propose a multiview architecture that can handle each modality separately. To effectively train on such data, this model enforces the internal speech and text encodings to be similar using a shared classifier. On the Switchboard Dialog Act corpus, we show that pretraining the classifier using large amounts of text helps learning better speech encodings, resulting in up to 40% relatively higher classification accuracies. We also show that when the speech embeddings from an automatic speech recognition (ASR) system are used in this framework, the speech-only accuracy exceeds the performance of ASR-text based tests up to 15% relative and approaches the performance of using true transcripts. more »

Award ID(s):: 1910319

PAR ID:: 10147858

Author(s) / Creator(s):: Sari, Leda; Thomas, Samuel; Hasegawa-Johnson, Mark

Date Published:: 2020-05-04

Journal Name:: IEEE ICASSP 2020

Page Range / eLocation ID:: 8109 to 8113

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/ICASSP40776.2020.9054664

More Like this