Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Wang, Liming; Wang, Xinsheng; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim

doi:10.1109/ICASSP39728.2021.9414418

Citation Details

Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and5% alignment F1 scores respectively. more »

Award ID(s):: 1910319

PAR ID:: 10273611

Author(s) / Creator(s):: Wang, Liming; Wang, Xinsheng; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim

Date Published:: 2021-06-06

Journal Name:: ICASSP

Page Range / eLocation ID:: 7603 to 7607

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/ICASSP39728.2021.9414418

More Like this