skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Harmonicity Plays a Critical Role in DNN Based Versus in Biologically-Inspired Monaural Speech Segregation Systems
Recent advancements in deep learning have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and DPT-Net [1],[2]. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech mixtures, making inharmonicity a powerful adversarial factor in DNN models. Furthermore, additional analyses reveal that DNN algorithms deviate markedly from biologically inspired algorithms [3] that rely primarily on timing cues and not harmonicity to segregate speech.  more » « less
Award ID(s):
1764010 2020624
PAR ID:
10355813
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Acoustics, Speech and Signal Processing
Page Range / eLocation ID:
536 to 540
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The robustness and vulnerability of Deep Neural Networks (DNN) are quickly becoming a critical area of interest since these models are in widespread use across real-world applications (i.e., image and audio analysis, recommendation system, natural language analysis, etc.). A DNN's vulnerability is exploited by an adversary to generate data to attack the model; however, the majority of adversarial data generators have focused on image domains with far fewer work on audio domains. More recently, audio analysis models were shown to be vulnerable to adversarial audio examples (e.g., speech command classification, automatic speech recognition, etc.). Thus, one urgent open problem is to detect adversarial audio reliably. In this contribution, we incorporate a separate and yet related DNN technique to detect adversarial audio, namely model quantization. Then we propose an algorithm to detect adversarial audio by using a DNN's quantization error. Specifically, we demonstrate that adversarial audio typically exhibits a larger activation quantization error than benign audio. The quantization error is measured using character error rates. We use the difference in errors to discriminate adversarial audio. Experiments with three the-state-of-the-art audio attack algorithms against the DeepSpeech model show our detection algorithm achieved high accuracy on the Mozilla dataset. 
    more » « less
  2. With the development of deep neural networks (DNN), many DNN-based speech dereverberation approaches have been proposed to achieve significant improvement over the traditional methods. However, most deep learning-based dereverberation methods solely focus on suppressing time-frequency domain reverberations without utilizing cepstral domain features which are potentially useful for dereverberation. In this paper, we propose a dual-path neural network structure to separately process minimum-phase and all-pass components of single channel speech. First, we decompose speech signal into minimum-phase and all-pass components in cepstral domain, then Conformer embedded U-Net is used to remove reverberations of both components. Finally, we combine these two processed components together to synthesize the enhanced output. The performance of proposed method is tested on REVERB-Challenge evaluation dataset in terms of commonly used objective metrics. Experimental results demonstrate that our method outperforms other compared methods. 
    more » « less
  3. Since emerging edge applications such as Internet of Things (IoT) analytics and augmented reality have tight latency constraints, hardware AI accelerators have been recently proposed to speed up deep neural network (DNN) inference run by these applications. Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads. In this article, we design analytic models to capture the performance of DNN inference workloads on shared edge accelerators, such as GPU and edgeTPU, under different multiplexing and concurrency behaviors. After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting their latency constraints. We implement a prototype of our system in Kubernetes and show that our system can host 2.3× more DNN applications in heterogeneous multi-tenant edge clusters with no latency violations when compared to traditional knapsack hosting algorithms. 
    more » « less
  4. We evaluated training deep neural network (DNN) beamformers for the task of high contrast imaging in the presence of reverberation clutter. Training data was generated using simulated hypoechoic cysts and a pseudo nonlinear method for generating reverberation clutter. Performance was compared to standard delay-and-sum (DAS) beamforming on simulated hypoechoic cysts having a different size. For a hypoechoic cyst in the presence of reverberation clutter, when the intrinsic contrast ratio (CR) was -10 dB and -20 dB, the measured CR for DAS beamforming was -9.2±0.8 dB and -14.3±0.5 dB, respectively, and the measured CR for DNNs was -10.7±1.4 dB and -20.0±1.0 dB, respectively. For a hypoechoic cyst with -20 dB intrinsic CR, the contrast-to-noise ratio (CNR) was 3.4±0.3 dB and 4.3±0.3 dB for DAS and DNN beamforming, respectively. These results show that DNN beamforming was able to extend contrast ratio dynamic range (CRDR) by about 10 dB while also improving CNR. 
    more » « less
  5. null (Ed.)
    This study tests speech-in-noise perception and social ratings of speech produced by different text-to-speech (TTS) synthesis methods. We used identical speaker training datasets for a set of 4 voices (using AWS Polly TTS), generated using neural and concatenative TTS. In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than for concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes. Neural TTS was rated as more human-like, natural, likeable, and familiar than concatenative TTS. Furthermore, how natural listeners rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech — and that these patterns are linked. Overall, this work contributes to our understanding of the of the nexus of speech technology and human speech perception. 
    more » « less