Textless Speech-to-Speech Translation With Limited Parallel Data

Diwan, Anuj; Srinivasan, Anirudh; Harwath, David; Choi, Eunsol

doi:10.18653/v1/2024.findings-emnlp.951

Speech conveys both linguistic messages and a wealth of social and identity information about a talker. This information arrives as complex variations across many acoustic dimensions. Ultimately, speech communication depends on experience within a language community to develop shared long-term knowledge of the mapping from acoustic patterns to the category distinctions that support word recognition, emotion evaluation, and talker identification. A great deal of research has focused on the learning involved in acquiring long-term knowledge to support speech categorization. Inadvertently, this focus may give the impression of a mature learning endpoint. Instead, there seems to be no firm line between perception and learning in speech. The contributions of acoustic dimensions are malleably reweighted continuously as a function of regularities evolving in short-term input. In this way, continuous learning across speech impacts the very nature of the mapping from sensory input to perceived category. This article presents a case study in understanding how incoming sensory input—and the learning that takes place across it—interacts with existing knowledge to drive predictions that tune the system to support future behavior.

More Like this