Interface Design for Self-Supervised Speech Models

Shih, Yi-Jen; Harwath, David

Citation Details

Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. Typically, SSL models are used as feature extractors, with a downstream prediction head trained for a specific task. However, since different layers of SSL models capture different types of information, the methods of combining them remain underexplored. To address this, the authors propose a general framework for SSL model utilization through the concept of an interface that connects the upstream and downstream. Within this view, the common technique of combining features via a layerwise weighted sum is treated as one specific interface. The authors propose several alternative interface designs and show that the weighted sum interface is suboptimal for many tasks. In particular, they demonstrate that a convolutional interface with depth scaling logarithmically with the upstream model’s depth consistently outperforms other designs. more »

Award ID(s):: 2505865

PAR ID:: 10631929

Author(s) / Creator(s):: Shih, Yi-Jen; Harwath, David

Publisher / Repository:: https://doi.org/10.48550/arXiv.2406.12209

Date Published:: 2024-06-18

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this