skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Speech Enhancement Using Forked Generative Adversarial Networks with Spectral Subtraction
Speech enhancement techniques that use a generative adversarial network (GAN) can effectively suppress noise while allowing models to be trained end-to-end. However, such techniques directly operate on time-domain waveforms, which are often highly-dimensional and require extensive computation. This paper proposes a novel GAN-based speech enhancement method, referred to as S-ForkGAN, that operates on log-power spectra rather than on time-domain speech waveforms, and uses a forked GAN structure to extract both speech and noise information. By operating on log-power spectra, one can seamlessly include conventional spectral subtraction techniques, and the parameter space typically has a lower dimension. The performance of S-ForkGAN is assessed for automatic speech recognition (ASR) using the TIMIT data set and a wide range of noise conditions. It is shown that S-ForkGAN outperforms existing GAN-based techniques and that it has a lower complexity.  more » « less
Award ID(s):
1725573
PAR ID:
10203595
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of Interspeech 2019
Page Range / eLocation ID:
3163 to 3167
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Speech enhancement is an essential component in robust automatic speech recognition (ASR) systems. Most speech enhancement methods are nowadays based on neural networks that use feature-mapping or mask-learning. This paper proposes a novel speech enhancement method that integrates time-domain feature mapping and mask learning into a unified framework using a Generative Adversarial Network (GAN). The proposed framework processes the received waveform and decouples speech and noise signals, which are fed into two short-time Fourier transform (STFT) convolution 1-D layers that map the waveforms to spectrograms in the complex domain. These speech and noise spectrograms are then used to compute the speech mask loss. The proposed method is evaluated using the TIMIT data set for seen and unseen signal-to-noise ratio conditions. It is shown that the proposed method outperforms the speech enhancement methods that use Deep Neural Network (DNN) based speech enhancement or a Speech Enhancement Generative Adversarial Network (SEGAN). 
    more » « less
  2. In this paper, we present a blockwise optimization method for masking-based networks (BLOOM-Net) for training scalable speech enhancement networks. Here, we design our network with a residual learning scheme and train the internal separator blocks sequentially to obtain a scalable masking-based deep neural network for speech enhancement. Its scalability lets it dynamically adjust the run-time complexity depending on the test time environment. To this end, we modularize our models in that they can flexibly accommodate varying needs for enhancement performance and constraints on the resources, incurring minimal memory or training overhead due to the added scalability. Our experiments on speech enhancement demonstrate that the proposed blockwise optimization method achieves the desired scalability with only a slight performance degradation compared to corresponding models trained end-to-end. 
    more » « less
  3. The high computational complexity associated with training deep neural networks limits online and real-time training on edge devices. This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity. We implement the entire training procedure in the log-domain, with fixed-point data representations. This training procedure is inspired by hardware-friendly approximations of log-domain addition which are based on look-up tables and bit-shifts. We show that our 16-bit log-based training can achieve classification accuracy within approximately 1% of the equivalent floating-point baselines for a number of commonly used datasets. 
    more » « less
  4. Speech activity detection (SAD) serves as a crucial front-end system to several downstream Speech and Language Technology (SLT) tasks such as speaker diarization, speaker identification, and speech recognition. Recent years have seen deep learning (DL)-based SAD systems designed to improve robustness against static background noise and interfering speakers. However, SAD performance can be severely limited for conversations recorded in naturalistic environments due to dynamic acoustic scenarios and previously unseen non-speech artifacts. In this letter, we propose an end-to-end deep learning framework designed to be robust to time-varying noise profiles observed in naturalistic audio. We develop a novel SAD solution for the UTDallas Fearless Steps Apollo corpus based on NASA’s Apollo missions. The proposed system leverages spectro-temporal correlations with a threshold optimization mechanism to adjust to acoustic variabilities across multiple channels and missions. This system is trained and evaluated on the Fearless Steps Challenge (FSC) corpus (a subset of the Apollo corpus). Experimental results indicate a high degree of adaptability to out-of-domain data, achieving a relative Detection Cost Function (DCF) performance improvement of over 50% compared to the previous FSC baselines and state-of-the-art (SOTA) SAD systems. The proposed model also outperforms the most recent DL-based SOTA systems from FSC Phase-4. Ablation analysis is conducted to confirm the efficacy of the proposed spectro-temporal features. 
    more » « less
  5. In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean speech, we employ the knowledge distillation framework: we distill the more advanced denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method can significantly improve the compact models' performance during the test time. Furthermore, since the personalized models outperform larger non-personalized baseline models, we claim that personalization achieves model compression with no loss of denoising performance. As expected, the student models underperform the state-of-the-art teacher models. 
    more » « less