Discovering word-like units without textual transcriptions is an important step in low-resource speech technology. In this work,we demonstrate a model inspired by statistical machine translation and hidden Markov model/deep neural network (HMM-DNN) hybrid systems. Our learning algorithm is capable of discovering the visual and acoustic correlates of distinct words in an unknown language by simultaneously learning the map-ping from image regions to concepts (the first DNN), the map-ping from acoustic feature vectors to phones (the second DNN),and the optimum alignment between the two (the HMM). In the simulated low-resource setting using MSCOCO and Speech-COCO datasets, our model achieves 62.4 % alignment accuracy and outperforms the audio-only segmental embedded GMM approach on standard word discovery evaluation metrics.
A DNN-Ensemble Method for Error Reduction and Training Data Selection in DNN Based Modeling
- Award ID(s):
- 1916535
- Publication Date:
- NSF-PAR ID:
- 10393482
- Journal Name:
- 2022 IEEE International Symposium on Electromagnetic Compatibility & Signal/Power Integrity (EMCSI)
- Page Range or eLocation-ID:
- 175 to 180
- Sponsoring Org:
- National Science Foundation