NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Elevating Robust ASR By Decoupling Multi-Channel Speaker Separation and Speech Recognition

https://doi.org/10.1109/ICASSP49660.2025.10888074

Yang, Yufeng; Taherian, Hassan; Kalkhorani, Vahid Ahmadi; Wang, DeLiang (April 2025, IEEE)

Free, publicly-accessible full text available April 6, 2026
Towards Explainable Monaural Speaker Separation with Auditory-based Training

Taherian, Hassan; Kalkhorani, Vahid Ahmadi; Pandey, Ashutosh; Wong, Daniel; Xu, Buye; Wang, DeLiang (September 2024, International Speech Communication Association)

Full Text Available
Leveraging Sound Localization to Improve Continuous Speaker Separation

https://doi.org/10.1109/ICASSP48485.2024.10446934

Taherian, Hassan; Pandey, Ashutosh; Wong, Daniel; Xu, Buye; Wang, DeLiang (April 2024, IEEE)

Continuous speaker separation aims to separate overlapping speakers in real-world environments like meetings, but it often falls short in isolating speech segments of a single speaker. This leads to split signals that adversely affect downstream applications such as automatic speech recognition and speaker diarization. Existing solutions like speaker counting have limitations. This paper presents a novel multi-channel approach for continuous speaker separation based on multi-input multi-output (MIMO) complex spectral mapping. This MIMO approach enables robust speaker localization by preserving inter-channel phase relations. Speaker localization as a byproduct of the MIMO separation model is then used to identify single-talker frames and reduce speaker splitting. We demonstrate that this approach achieves superior frame-level sound localization. Systematic experiments on the LibriCSS dataset further show that the proposed approach outperforms other methods, advancing state-of-the-art speaker separation performance.
more » « less
Full Text Available
Multi-Channel Conversational Speaker Separation via Neural Diarization

https://doi.org/10.1109/TASLP.2024.3393726

Taherian, Hassan; Wang, DeLiang (January 2024, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called “speaker separation via neural diarization” (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments—a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin.
more » « less
Full Text Available
TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

https://doi.org/10.1109/TASLP.2024.3492803

Kalkhorani, Vahid Ahmadi; Wang, DeLiang (January 2024, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available
Multi-input Multi-output Complex Spectral Mapping for Speaker Separation

https://doi.org/10.21437/Interspeech.2023-318

Taherian, Hassan; Pandey, Ashutosh; Wong, Daniel; Xu, Buye; Wang, DeLiang (August 2023, ISCA)

Current deep learning based multi-channel speaker sepa- ration methods produce a monaural estimate of speaker sig- nals captured by a reference microphone. This work presents a new multi-channel complex spectral mapping approach that simultaneously estimates the real and imaginary spectrograms of all speakers at all microphones. The proposed multi-input multi-output (MIMO) separation model uses a location-based training (LBT) criterion to resolve the permutation ambiguity in talker-independent speaker separation across microphones. Experimental results show that the proposed MIMO separation model outperforms a multi-input single-output (MISO) speaker separation model with monaural estimates. We also combine the MIMO separation model with a beamformer and a MISO speech enhancement model to further improve separation performance. The proposed approach achieves the state-of-the-art speaker separation on the open LibriCSS dataset.
more » « less
Full Text Available
Multi-Resolution Location-Based Training for Multi-Channel Continuous Speech Separation

https://doi.org/10.1109/ICASSP49357.2023.10096684

Hassan Taherian; DeLiang Wang (June 2023, Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing)

The performance of automatic speech recognition (ASR) systems severely degrades when multi-talker speech overlap occurs. In meeting environments, speech separation is typically performed to improve the robustness of ASR systems. Recently, location-based training (LBT) was proposed as a new training criterion for multi-channel talker-independent speaker separation. Assuming fixed array geometry, LBT outperforms widely-used permutation-invariant training in fully overlapped utterances and matched reverberant conditions. This paper extends LBT to conversational multi-channel speaker separation. We introduce multi-resolution LBT to estimate the complex spectrograms from low to high time and frequency resolutions. With multi-resolution LBT, convolutional kernels are assigned consistently based on speaker locations in physical space. Evaluation results show that multi-resolution LBT consistently outperforms other competitive methods on the recorded LibriCSS corpus.
more » « less
Full Text Available

Search for: All records