skip to main content

Title: Hear "No Evil", See "Kenansville": Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems
Automatic speech recognition and voice identification systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have significant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box, transferable, can be tuned to require zero queries to the target, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statistically significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the IEEE Symposium on Security and Privacy
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Automatic Speech Recognition (ASR) systems are widely used in various online transcription services and personal digital assistants. Emerging lines of research have demonstrated that ASR systems are vulnerable to hidden voice commands, i.e., audio that can be recognized by ASRs but not by humans. Such attacks, however, often either highly depend on white-box knowledge of a specific machine learning model or require special hardware to construct the adversarial audio. This paper proposes a new model-agnostic and easily-constructed attack, called CommanderGabble, which uses fast speech to camouflage voice commands. Both humans and ASR systems often misinterpret fast speech, and such misinterpretation can be exploited to launch hidden voice command attacks. Specifically, by carefully manipulating the phonetic structure of a target voice command, ASRs can be caused to derive a hidden meaning from the manipulated, high-speed version. We implement the discovered attacks both over-the-wire and over-the-air, and conduct a suite of experiments to demonstrate their efficacy against 7 practical ASR systems. Our experimental results show that the over-the-wire attacks can disguise as many as 96 out of 100 tested voice commands into adversarial ones, and that the over-the-air attacks are consistently successful for all 18 chosen commands in multiple real-world scenarios. 
    more » « less
  2. Automatic Speech Recognition (ASR) systems convert speech into text and can be placed into two broad categories: traditional and fully end-to-end. Both types have been shown to be vulnerable to adversarial audio examples that sound benign to the human ear but force the ASR to produce malicious transcriptions. Of these attacks, only the "psychoacoustic" attacks can create examples with relatively imperceptible perturbations, as they leverage the knowledge of the human auditory system. Unfortunately, existing psychoacoustic attacks can only be applied against traditional models, and are obsolete against the newer, fully end-to-end ASRs. In this paper, we propose an equalization-based psychoacoustic attack that can exploit both traditional and fully end-to-end ASRs. We successfully demonstrate our attack against real-world ASRs that include DeepSpeech and Wav2Letter. Moreover, we employ a user study to verify that our method creates low audible distortion. Specifically, 80 of the 100 participants voted in favor of all our attack audio samples as less noisier than the existing state-of-the-art attack. Through this, we demonstrate both types of existing ASR pipelines can be exploited with minimum degradation to attack audio quality. 
    more » « less
  3. INTRODUCTION: CRSS-UTDallas initiated and oversaw the efforts to recover APOLLO mission communications by re-engineering the NASA SoundScriber playback system, and digitizing 30-channel analog audio tapes – with the entire Apollo-11, Apollo-13, and Gemini-8 missions during 2011-17 [1,6]. This vast data resource was made publicly available along with supplemental speech & language technologies meta-data based on CRSS pipeline diarization transcripts and conversational speaker time-stamps for Apollo team at NASA Mission Control Center, [2,4]. In 2021, renewed efforts over the past year have resulted in the digitization of an additional +50,000hrs of audio from Apollo 7,8,9,10,12 missions, and remaining A-13 tapes. Cumulative digitization efforts have enabled the development of the largest publicly available speech data resource with unprompted, real conversations recorded in naturalistic environments. Deployment of this massive corpus has inspired multiple collaborative initiatives such as Web resources ExploreApollo ( LanguageARC ( [3]. serves as the visualization and play-back tool, and LanguageARC the crowd source subject content tagging resource developed by UG/Grad. Students, intended as an educational resource for k-12 students, and STEM/Apollo enthusiasts. Significant algorithmic advancements have included advanced deep learning models that are now able to improve automatic transcript generation quality, and even extract high level knowledge such as ID labels of topics being spoken across different mission stages. Efficient transcript generation and topic extraction tools for this naturalistic audio have wide applications including content archival and retrieval, speaker indexing, education, group dynamics and team cohesion analysis. Some of these applications have been deployed in our online portals to provide a more immersive experience for students and researchers. Continued worldwide outreach in the form of the Fearless Steps Challenges has proven successful with the most recent Phase-4 of the Challenge series. This challenge has motivated research in low level tasks such as speaker diarization and high level tasks like topic identification. IMPACT: Distribution and visualization of the Apollo audio corpus through the above mentioned online portals and Fearless Steps Challenges have produced significant impact as a STEM education resource for K-12 students as well as a SLT development resource with real-world applications for research organizations globally. The speech technologies developed by CRSS-UTDallas using the Fearless Steps Apollo corpus have improved previous benchmarks on multiple tasks [1, 5]. The continued initiative will extend the current digitization efforts to include over 150,000 hours of audio recorded during all Apollo missions. ILLUSTRATION: We will demonstrate WebExploreApollo and LanguageARC online portals with newly digitized audio playback in addition to improved SLT baseline systems, the results from ASR and Topic Identification systems which will include research performed on the corpus conversational. Performance analysis visualizations will also be illustrated. We will also display results from the past challenges and their state-of-the-art system improvements. 
    more » « less
  4. Speech is a natural channel for human-computer interaction in robotics and consumer applications. Natural language understanding pipelines that start with speech can have trouble recovering from speech recognition errors. Black-box automatic speech recognition (ASR) systems, built for general purpose use, are unable to take advantage of in-domain language models that could otherwise ameliorate these errors. In this work, we present a method for re-ranking black-box ASR hypotheses using an in-domain language model and semantic parser trained for a particular task. Our re-ranking method significantly improves both transcription accuracy and semantic understanding over a state-of-the-art ASR’s vanilla output. 
    more » « less
  5. Abstract

    Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co‐speech gestures is a long‐standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non‐periodic nature of human co‐speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep‐learning‐based generative models that benefit from the growing availability of data. This review article summarizes co‐speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule‐based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non‐linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human‐like motion; grounding the gesture in the co‐occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

    more » « less