Integrating Self-Supervised Speech Model with Pseudo Word-Level Targets from Visually-Grounded Speech Model

Fang, Hung-Chieh; Ye, Nai-Xuan; Shih, Yi-Jen; Peng, Puyuan; Wang, Hsuan-Fu; Berry, Layne; Lee, Hung-Yi; Harwath, David

doi:10.1109/ICASSPW62465.2024.10625802

Entrainment, the phenomenon of conversational partners’ speech becoming more similar to each other, is generally accepted to be an important aspect of human-human and human-machine communication. However, there is a gap between accepted psycholinguistic models of entrainment and the body of empirical findings, which includes a large number of unexplained negative results. Existing research does not provide insights specific enough to guide the implementation of entraining spoken dialogue systems or the interpretation of entrainment as a measure of quality. A more integrated model of entrainment is proposed, which looks for consistent explanations of entrainment behavior on specific features and how they interact with speaker, session, and utterance characteristics.

More Like this