skip to main content

Title: Improved Speech Enhancement Using a Time-Domain GAN with Mask Learning
Speech enhancement is an essential component in robust automatic speech recognition (ASR) systems. Most speech enhancement methods are nowadays based on neural networks that use feature-mapping or mask-learning. This paper proposes a novel speech enhancement method that integrates time-domain feature mapping and mask learning into a unified framework using a Generative Adversarial Network (GAN). The proposed framework processes the received waveform and decouples speech and noise signals, which are fed into two short-time Fourier transform (STFT) convolution 1-D layers that map the waveforms to spectrograms in the complex domain. These speech and noise spectrograms are then used to compute the speech mask loss. The proposed method is evaluated using the TIMIT data set for seen and unseen signal-to-noise ratio conditions. It is shown that the proposed method outperforms the speech enhancement methods that use Deep Neural Network (DNN) based speech enhancement or a Speech Enhancement Generative Adversarial Network (SEGAN).  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of Interspeech 2020
Page Range / eLocation ID:
3286 to 3290
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Speech enhancement techniques that use a generative adversarial network (GAN) can effectively suppress noise while allowing models to be trained end-to-end. However, such techniques directly operate on time-domain waveforms, which are often highly-dimensional and require extensive computation. This paper proposes a novel GAN-based speech enhancement method, referred to as S-ForkGAN, that operates on log-power spectra rather than on time-domain speech waveforms, and uses a forked GAN structure to extract both speech and noise information. By operating on log-power spectra, one can seamlessly include conventional spectral subtraction techniques, and the parameter space typically has a lower dimension. The performance of S-ForkGAN is assessed for automatic speech recognition (ASR) using the TIMIT data set and a wide range of noise conditions. It is shown that S-ForkGAN outperforms existing GAN-based techniques and that it has a lower complexity. 
    more » « less
  2. Image synthesis from corrupted contrasts increases the diver- sity of diagnostic information available for many neurological diseases. Recently the image-to-image translation has experienced signi cant lev- els of interest within medical research, beginning with the successful use of the Generative Adversarial Network (GAN) to the introduction of cyclic constraint extended to multiple domains. However, in current ap- proaches, there is no guarantee that the mapping between the two image domains would be unique or one-to-one. In this paper, we introduce a novel approach to unpaired image-to-image translation based on the invertible architecture. The invertible property of the ow-based architecture assures a cycle-consistency of image-to-image translation without additional loss functions. We utilize the temporal informa- tion between consecutive slices to provide more constraints to the optimization for transforming one domain to another in un- paired volumetric medical images. To capture temporal structures in the medical images, we explore the displacement between the consec- utive slices using a deformation eld. In our approach, the deformation eld is used as a guidance to keep the translated slides realistic and con- sistent across the translation. The experimental results have shown that the synthesized images using our proposed approach are able to archive a competitive performance in terms of mean squared error, peak signal- to-noise ratio, and structural similarity index when compared with the existing deep learning-based methods on three standard datasets, i.e. HCP, MRBrainS13 and Brats2019. 
    more » « less
  3. Electromigration (EM) is a major failure effect for on-chip power grid networks of deep submicron VLSI circuits. EM degradation of metal grid lines can lead to excessive voltage drops (IR drops) before the target lifetime. In this paper, we propose a fast data-driven EM-induced IR drop analysis framework for power grid networks, named {\it GridNet}, based on the conditional generative adversarial networks (CGAN). It aims to accelerate the incremental full-chip EM-induced IR drop analysis, as well as IR drop violation fixing during the power grid design and optimization. More importantly, {\it GridNet} can naturally leverage the differentiable feature of deep neural networks (DNN) to {\it obtain the sensitivity information of node voltage with respect to the wire resistance (or width) with marginal cost}. {\it GridNet} treats continuous time and the given electrical features as input conditions, and the EM-induced time-varying voltage of power grid networks as the conditional outputs, which are represented as data series images. We show that {\it GridNet} is able to learn the temporal dynamics of the aging process in continuous time domain. Besides, we can take advantage of the sensitivity information provided by {\it GridNet} to perform efficient localized IR drop violation fixing in the late stage design and optimization. Numerical results on 36000 synthesized power grid network samples demonstrate that the new method can lead to $10^5\times$ speedup over the recently proposed full-chip coupled EM and IR drop analysis tool. We further show that localized IR drop violation fix for the same set of power grid networks can be performed remarkably efficiently using the cheap sensitivity computation from {\it GridNet}. 
    more » « less
  4. Abstract Purpose

    Synthetic digital mammogram (SDM) is a 2D image generated from digital breast tomosynthesis (DBT) and used as a substitute for a full‐field digital mammogram (FFDM) to reduce the radiation dose for breast cancer screening. The previous deep learning‐based method used FFDM images as the ground truth, and trained a single neural network to directly generate SDM images with similar appearances (e.g., intensity distribution, textures) to the FFDM images. However, the FFDM image has a different texture pattern from DBT. The difference in texture pattern might make the training of the neural network unstable and result in high‐intensity distortion, which makes it hard to decrease intensity distortion and increase perceptual similarity (e.g., generate similar textures) at the same time. Clinically, radiologists want to have a 2D synthesized image that feels like an FFDM image in vision and preserves local structures such as both mass and microcalcifications (MCs) in DBT because radiologists have been trained on reading FFDM images for a long time, while local structures are important for diagnosis. In this study, we proposed to use a deep convolutional neural network to learn the transformation to generate SDM from DBT.


    To decrease intensity distortion and increase perceptual similarity, a multi‐scale cascaded network (MSCN) is proposed to generate low‐frequency structures (e.g., intensity distribution) and high‐frequency structures (e.g., textures) separately. The MSCN consist of two cascaded sub‐networks: the first sub‐network is used to predict the low‐frequency part of the FFDM image; the second sub‐network is used to generate a full SDM image with textures similar to the FFDM image based on the prediction of the first sub‐network. The mean‐squared error (MSE) objective function is used to train the first sub‐network, termed low‐frequency network, to generate a low‐frequency SDM image. The gradient‐guided generative adversarial network's objective function is to train the second sub‐network, termed high‐frequency network, to generate a full SDM image with textures similar to the FFDM image.


    1646 cases with FFDM and DBT were retrospectively collected from the Hologic Selenia system for training and validation dataset, and 145 cases with masses or MC clusters were independently collected from the Hologic Selenia system for testing dataset. For comparison, the baseline network has the same architecture as the high‐frequency network and directly generates a full SDM image. Compared to the baseline method, the proposed MSCN improves the peak‐to‐noise ratio from 25.3 to 27.9 dB and improves the structural similarity from 0.703 to 0.724, and significantly increases the perceptual similarity.


    The proposed method can stabilize the training and generate SDM images with lower intensity distortion and higher perceptual similarity.

    more » « less
  5. Head movement is an integral part of face-to-face communications. It is important to investigate methodologies to generate naturalistic movements for conversational agents (CAs). The predominant method for head movement generation is using rules based on the meaning of the message. However, the variations of head movements by these methods are bounded by the predefined dictionary of gestures. Speech-driven methods offer an alternative approach, learning the relationship between speech and head movements from real recordings. However, previous studies do not generate novel realizations for a repeated speech signal. Conditional generative adversarial network (GAN) provides a framework to generate multiple realizations of head movements for each speech segment by sampling from a conditioned distribution. We build a conditional GAN with bidirectional long-short term memory (BLSTM), which is suitable for capturing the long-short term dependencies of time- continuous signals. This model learns the distribution of head movements conditioned on speech prosodic features. We compare this model with a dynamic Bayesian network (DBN) and BLSTM models optimized to reduce mean squared error (MSE) or to increase concordance correlation. The objective evaluations and subjective evaluations of the results showed better performance for the condi- tional GAN model compared with these baseline systems. 
    more » « less