skip to main content


This content will become publicly available on June 12, 2024

Title: SpoTNet: A spoofing-aware Transformer Network for Effective Synthetic Speech Detection
The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems.  more » « less
Award ID(s):
1815724
NSF-PAR ID:
10427031
Author(s) / Creator(s):
;
Date Published:
Journal Name:
MAD '23: Proceedings of the 2nd ACM International Workshop on Multimedia AI against Disinformation
Page Range / eLocation ID:
10 to 18
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Jean-Jacques Rousseau ; Bill Kapralos ; Henrik I. Christensen ; Michael Jenkin ; Cheng-Lin (Ed.)
    Exponential growth in the use of smart speakers (SS) for the automation of homes, offices, and vehicles has brought a revolution of convenience to our lives. However, these SSs are susceptible to a variety of spoofing attacks, known/seen and unknown/unseen, created using cutting-edge AI generative algorithms. The realistic nature of these powerful attacks is capable of deceiving the automatic speaker verification (ASV) engines of these SSs, resulting in a huge potential for fraud using these devices. This vulnerability highlights the need for the development of effective countermeasures capable of the reliable detection of known and unknown spoofing attacks. This paper presents a novel end-to-end deep learning model, AEXANet, to effectively detect multiple types of physical- and logical-access attacks, both known and unknown. The proposed countermeasure has the ability to learn low-level cues by analyzing raw audio, utilizes a dense convolutional network for the propagation of diversified raw waveform features, and strengthens feature propagation. This system employs a maximum feature map activation function, which improves the performance against unseen spoofing attacks while making the model more efficient, enabling the model to be used for real-time applications. An extensive evaluation of our model was performed on the ASVspoof 2019 PA and LA datasets, along with TTS and VC samples, separately containing both seen and unseen attacks. Moreover, cross corpora evaluation using the ASVspoof 2019 and ASVspoof 2015 datasets was also performed. Experimental results show the reliability of our method for voice spoofing detection. 
    more » « less
  2. With the advent of automated speaker verifcation (ASV) systems comes an equal and opposite development: malicious actors may seek to use voice spoofng attacks to fool those same systems. Various counter measures have been proposed to detect these spoofing attacks, but current oferings in this arena fall short of a unifed and generalized approach applicable in real-world scenarios. For this reason, defensive measures for ASV systems produced in the last 6-7 years need to be classifed, and qualitative and quantitative comparisons of state-of-the-art (SOTA) counter measures should be performed to assess the efectiveness of these systems against real-world attacks. Hence, in this work, we conduct a review of the literature on spoofng detection using hand-crafted features, deep learning, and end-to-end spoofng countermeasure solutions to detect logical access attacks, such as speech synthesis and voice conversion, and physical access attacks, i.e., replay attacks. Additionally, we review integrated and unifed solutions to voice spoofng evaluation and speaker verifcation, and adversarial and anti-forensic attacks on both voice counter measures and ASV systems. In an extensive experimental analysis, the limitations and challenges of existing spoofng counter measures are presented, the performance of these counter measures on several datasets is reported, and cross-corpus evaluations are performed, something that is nearly absent in the existing literature, in order to assess the generalizability of existing solutions. For the experiments, we employ the ASVspoof2019, ASVspoof2021, and VSDC datasets along with GMM, SVM, CNN, and CNN-GRU classifers. For reproducibility of the results, the code of the testbed can be found at our GitHub Repository (https://github.com/smileslab/Comparative-Analysis-Voice-Spoofing). 
    more » « less
  3. Non-parallel many-to-many voice conversion remains an interesting but challenging speech processing task. Many style-transfer-inspired methods such as generative adversarial networks (GANs) and variational autoencoders (VAEs) have been proposed. Recently, AUTOVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker’s identity embedding to synthesize a new voice. However, we found that while speaker identity is disentangled from speech content, a significant amount of prosodic information, such as source F0, leaks through the bottleneck, causing target F0 to fluctuate unnaturally. Furthermore, AUTOVC has no control of the converted F0 and thus unsuitable for many applications. In the paper, we modified and improved autoencoder-based voice conversion to disentangle content, F0, and speaker identity at the same time. Therefore, we can control the F0 contour, generate speech with F0 consistent with the target speaker, and significantly improve quality and similarity. We support our improvement through quantitative and qualitative analysis. 
    more » « less
  4. Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained on the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work. 
    more » « less
  5. null (Ed.)
    This study tests speech-in-noise perception and social ratings of speech produced by different text-to-speech (TTS) synthesis methods. We used identical speaker training datasets for a set of 4 voices (using AWS Polly TTS), generated using neural and concatenative TTS. In Experiment 1, listeners identified target words in semantically predictable and unpredictable sentences in concatenative and neural TTS at two noise levels (-3 dB, -6 dB SNR). Correct word identification was lower for neural TTS than for concatenative TTS, in the lower SNR, and for semantically unpredictable sentences. In Experiment 2, listeners rated the voices on 4 social attributes. Neural TTS was rated as more human-like, natural, likeable, and familiar than concatenative TTS. Furthermore, how natural listeners rated the neural TTS voice was positively related to their speech-in-noise accuracy. Together, these findings show that the TTS method influences both intelligibility and social judgments of speech — and that these patterns are linked. Overall, this work contributes to our understanding of the of the nexus of speech technology and human speech perception. 
    more » « less