skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Multi-Resolution Location-Based Training for Multi-Channel Continuous Speech Separation
The performance of automatic speech recognition (ASR) systems severely degrades when multi-talker speech overlap occurs. In meeting environments, speech separation is typically performed to improve the robustness of ASR systems. Recently, location-based training (LBT) was proposed as a new training criterion for multi-channel talker-independent speaker separation. Assuming fixed array geometry, LBT outperforms widely-used permutation-invariant training in fully overlapped utterances and matched reverberant conditions. This paper extends LBT to conversational multi-channel speaker separation. We introduce multi-resolution LBT to estimate the complex spectrograms from low to high time and frequency resolutions. With multi-resolution LBT, convolutional kernels are assigned consistently based on speaker locations in physical space. Evaluation results show that multi-resolution LBT consistently outperforms other competitive methods on the recorded LibriCSS corpus.  more » « less
Award ID(s):
2125074
PAR ID:
10439577
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing
ISSN:
2379-190X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Current deep learning based multi-channel speaker sepa- ration methods produce a monaural estimate of speaker sig- nals captured by a reference microphone. This work presents a new multi-channel complex spectral mapping approach that simultaneously estimates the real and imaginary spectrograms of all speakers at all microphones. The proposed multi-input multi-output (MIMO) separation model uses a location-based training (LBT) criterion to resolve the permutation ambiguity in talker-independent speaker separation across microphones. Experimental results show that the proposed MIMO separation model outperforms a multi-input single-output (MISO) speaker separation model with monaural estimates. We also combine the MIMO separation model with a beamformer and a MISO speech enhancement model to further improve separation performance. The proposed approach achieves the state-of-the-art speaker separation on the open LibriCSS dataset. 
    more » « less
  2. When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called “speaker separation via neural diarization” (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments—a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin. 
    more » « less
  3. Continuous speaker separation aims to separate overlapping speakers in real-world environments like meetings, but it often falls short in isolating speech segments of a single speaker. This leads to split signals that adversely affect downstream applications such as automatic speech recognition and speaker diarization. Existing solutions like speaker counting have limitations. This paper presents a novel multi-channel approach for continuous speaker separation based on multi-input multi-output (MIMO) complex spectral mapping. This MIMO approach enables robust speaker localization by preserving inter-channel phase relations. Speaker localization as a byproduct of the MIMO separation model is then used to identify single-talker frames and reduce speaker splitting. We demonstrate that this approach achieves superior frame-level sound localization. Systematic experiments on the LibriCSS dataset further show that the proposed approach outperforms other methods, advancing state-of-the-art speaker separation performance. 
    more » « less
  4. Abstract Modern automatic speech recognition (ASR) systems are capable of impressive performance recognizing clean speech but struggle in noisy, multi-talker environments, commonly referred to as the “cocktail party problem.” In contrast, many human listeners can solve this problem, suggesting the existence of a solution in the brain. Here we present a novel approach that uses a brain inspired sound segregation algorithm (BOSSA) as a preprocessing step for a state-of-the-art ASR system (Whisper). We evaluated BOSSA’s impact on ASR accuracy in a spatialized multi-talker scene with one target speaker and two competing maskers, varying the difficulty of the task by changing the target-to-masker ratio. We found that median word error rate improved by up to 54% when the target-to-masker ratio was low. Our results indicate that brain-inspired algorithms have the potential to considerably enhance ASR accuracy in challenging multi-talker scenarios without the need for retraining or fine-tuning existing state-of-the-art ASR systems. 
    more » « less
  5. Recent work on perceptual learning for speech has suggested that while high-variability training typically results in generalization, low-variability exposure can sometimes be sufficient for cross-talker generalization. We tested predictions of a similarity-based account, according to which, generalization depends on training-test talker similarity rather than on exposure to variability. We compared perceptual adaptation to second-language (L2) speech following single- or multiple-talker training with a round-robin design in which four L2 English talkers from four different first-language (L1) backgrounds served as both training and test talkers. After exposure to 60 L2 English sentences in one training session, cross-talker/cross-accent generalization was possible (but not guaranteed) following either multiple- or single-talker training with variation across training-test talker pairings. Contrary to predictions of the similarity-based account, adaptation was not consistently better for identical than for mismatched training-test talker pairings, and generalization patterns were asymmetrical across training-test talker pairs. Acoustic analyses also revealed a dissociation between phonetic similarity and cross-talker/cross-accent generalization. Notably, variation in adaptation and generalization related to variation in training phase intelligibility. Together with prior evidence, these data suggest that perceptual learning for speech may benefit from some combination of exposure to talker variability, training-test similarity, and high training phase intelligibility. 
    more » « less