Multi-Channel Conversational Speaker Separation via Neural Diarization

Taherian, Hassan; Wang, DeLiang

doi:10.1109/TASLP.2024.3393726

Citation Details

Multi-Channel Conversational Speaker Separation via Neural Diarization

When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called “speaker separation via neural diarization” (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments—a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin. more »

Award ID(s):: 2125074

PAR ID:: 10552805

Author(s) / Creator(s):: Taherian, Hassan; Wang, DeLiang

Publisher / Repository:: IEEE

Date Published:: 2024-01-01

Journal Name:: IEEE/ACM Transactions on Audio, Speech, and Language Processing

Volume:: 32

ISSN:: 2329-9290

Page Range / eLocation ID:: 2467 to 2476

Subject(s) / Keyword(s):: Multi-channel speaker diarization conversational speaker separation location-based training multi-speaker speech recognition

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1109/TASLP.2024.3393726

More Like this