skip to main content


Title: SoK: The Faults in our ASRs: An Overview of Attacks against Automatic Speech Recognition and Speaker Identification Systems
Speech and speaker recognition systems are employed in a variety of applications, from personal assistants to telephony surveillance and biometric authentication. The wide deployment of these systems has been made possible by the improved accuracy in neural networks. Like other systems based on neural networks, recent research has demonstrated that speech and speaker recognition systems are vulnerable to attacks using manipulated inputs. However, as we demonstrate in this paper, the end-to-end architecture of speech and speaker systems and the nature of their inputs make attacks and defenses against them substantially different than those in the image space. We demonstrate this first by systematizing existing research in this space and providing a taxonomy through which the community can evaluate future work. We then demonstrate experimentally that attacks against these models almost universally fail to transfer. In so doing, we argue that substantial additional work is required to provide adequate mitigations in this space.  more » « less
Award ID(s):
1933208
NSF-PAR ID:
10287474
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the IEEE Symposium on Security and Privacy
ISSN:
1063-9578
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The prevalence of voice spoofing attacks in today’s digital world has become a critical security concern. Attackers employ various techniques, such as voice conversion (VC) and text-to-speech (TTS), to generate synthetic speech that imitates the victim’s voice and gain access to sensitive information. The recent advances in synthetic speech generation pose a significant threat to modern security systems, while traditional voice authentication methods are incapable of detecting them effectively. To address this issue, a novel solution for logical access (LA)-based synthetic speech detection is proposed in this paper. SpoTNet is an attention-based spoofing transformer network that includes crafted front-end spoofing features and deep attentive features retrieved using the developed logical spoofing transformer encoder (LSTE). The derived attentive features were then processed by the proposed multi-layer spoofing classifier to classify speech samples as bona fide or synthetic. In synthetic speeches produced by the TTS algorithm, the spectral characteristics of the synthetic speech are altered to match the target speaker’s formant frequencies, while in VC attacks, the temporal alignment of the speech segments is manipulated to preserve the target speaker’s prosodic features. By highlighting these observations, this paper targets the prosodic and phonetic-based crafted features, i.e., the Mel-spectrogram, spectral contrast, and spectral envelope, presenting an effective preprocessing pipeline proven to be effective in synthetic speech detection. The proposed solution achieved state-of-the-art performance against eight recent feature fusion methods with lower EER of 0.95% on the ASVspoof-LA dataset, demonstrating its potential to advance the field of speaker identification and improve speaker recognition systems. 
    more » « less
  2. With the advent of automated speaker verifcation (ASV) systems comes an equal and opposite development: malicious actors may seek to use voice spoofng attacks to fool those same systems. Various counter measures have been proposed to detect these spoofing attacks, but current oferings in this arena fall short of a unifed and generalized approach applicable in real-world scenarios. For this reason, defensive measures for ASV systems produced in the last 6-7 years need to be classifed, and qualitative and quantitative comparisons of state-of-the-art (SOTA) counter measures should be performed to assess the efectiveness of these systems against real-world attacks. Hence, in this work, we conduct a review of the literature on spoofng detection using hand-crafted features, deep learning, and end-to-end spoofng countermeasure solutions to detect logical access attacks, such as speech synthesis and voice conversion, and physical access attacks, i.e., replay attacks. Additionally, we review integrated and unifed solutions to voice spoofng evaluation and speaker verifcation, and adversarial and anti-forensic attacks on both voice counter measures and ASV systems. In an extensive experimental analysis, the limitations and challenges of existing spoofng counter measures are presented, the performance of these counter measures on several datasets is reported, and cross-corpus evaluations are performed, something that is nearly absent in the existing literature, in order to assess the generalizability of existing solutions. For the experiments, we employ the ASVspoof2019, ASVspoof2021, and VSDC datasets along with GMM, SVM, CNN, and CNN-GRU classifers. For reproducibility of the results, the code of the testbed can be found at our GitHub Repository (https://github.com/smileslab/Comparative-Analysis-Voice-Spoofing). 
    more » « less
  3. Learning to understand grounded language, which connects natural language to percepts, is a critical research area. Prior work in grounded language acquisition has focused primarily on textual inputs. In this work, we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs. This will allow interactions in which language about novel tasks and environments is learned from end-users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems. We leverage recent work in self-supervised speech representation models and show that learned representations of speech can make language grounding systems more inclusive towards specific groups while maintaining or even increasing general performance. 
    more » « less
  4. Learning to understand grounded language, which connects natural language to percepts, is a critical research area. Prior work in grounded language acquisition has focused primarily on textual inputs. In this work, we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs. This will allow interactions in which language about novel tasks and environments is learned from end-users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems. We leverage recent work in self-supervised speech representation models and show that learned representations of speech can make language grounding systems more inclusive towards specific groups while maintaining or even increasing general performance. 
    more » « less
  5. null (Ed.)
    Security of machine learning is increasingly becoming a major concern due to the ubiquitous deployment of deep learning in many security-sensitive domains. Many prior studies have shown external attacks such as adversarial examples that tamper the integrity of DNNs using maliciously crafted inputs. However, the security implication of internal threats (i.e., hardware vulnerabilities) to DNN models has not yet been well understood. In this paper, we demonstrate the first hardware-based attack on quantized deep neural networks–DeepHammer–that deterministically induces bit flips in model weights to compromise DNN inference by exploiting the rowhammer vulnerability. DeepHammer performs an aggressive bit search in the DNN model to identify the most vulnerable weight bits that are flippable under system constraints. To trigger deterministic bit flips across multiple pages within a reasonable amount of time, we develop novel system-level techniques that enable fast deployment of victim pages, memory-efficient rowhammering and precise flipping of targeted bits. DeepHammer can deliberately degrade the inference accuracy of the victim DNN system to a level that is only as good as random guess, thus completely depleting the intelligence of targeted DNN systems. We systematically demonstrate our attacks on real systems against 11 DNN architectures with 4 datasets corresponding to different application domains. Our evaluation shows that DeepHammer is able to successfully tamper DNN inference behavior at run-time within a few minutes. We further discuss several mitigation techniques from both algorithm and system levels to protect DNNs against such attacks. Our work highlights the need to incorporate security mechanisms in future deep learning systems to enhance the robustness against hardware-based deterministic fault injections. 
    more » « less