A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions

Wang, Liming; Hasegawa-Johnson, Mark

doi:10.21437/Interspeech.2020-1148

Citation Details

A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions

Discovering word-like units without textual transcriptions is an important step in low-resource speech technology. In this work,we demonstrate a model inspired by statistical machine translation and hidden Markov model/deep neural network (HMM-DNN) hybrid systems. Our learning algorithm is capable of discovering the visual and acoustic correlates of distinct words in an unknown language by simultaneously learning the map-ping from image regions to concepts (the first DNN), the map-ping from acoustic feature vectors to phones (the second DNN),and the optimum alignment between the two (the HMM). In the simulated low-resource setting using MSCOCO and Speech-COCO datasets, our model achieves 62.4 % alignment accuracy and outperforms the audio-only segmental embedded GMM approach on standard word discovery evaluation metrics. more »

Award ID(s):: 1910319

PAR ID:: 10273579

Author(s) / Creator(s):: Wang, Liming; Hasegawa-Johnson, Mark

Date Published:: 2020-10-25

Journal Name:: Interspeech

Page Range / eLocation ID:: 1456 to 1460

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.21437/Interspeech.2020-1148

More Like this