skip to main content


Title: Hello, Is It Me You're Looking For?: Differentiating Between Human and Electronic Speakers for Voice Interface Security
Voice interfaces are increasingly becoming integrated into a variety of Internet of Things (IoT) devices. Such systems can dramatically simplify interactions between users and devices with limited displays. Unfortunately voice interfaces also create new opportunities for exploitation. Specifically any sound-emitting device within range of the system implementing the voice interface (e.g., a smart television, an Internet-connected appliance, etc) can potentially cause these systems to perform operations against the desires of their owners (e.g., unlock doors, make unauthorized purchases, etc). We address this problem by developing a technique to recognize fundamental differences in audio created by humans and electronic speakers. We identify sub-bass over-excitation, or the presence of significant low frequency signals that are outside of the range of human voices but inherent to the design of modern speakers, as a strong differentiator between these two sources. After identifying this phenomenon, we demonstrate its use in preventing adversarial requests, replayed audio, and hidden commands with a 100%/1.72% TPR/FPR in quiet environments. In so doing, we demonstrate that commands injected via nearby audio devices can be effectively removed by voice interfaces.  more » « less
Award ID(s):
1702879
NSF-PAR ID:
10091982
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 11th ACM Conference on Security & Privacy in Wireless and Mobile Networks
Page Range / eLocation ID:
123 to 133
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Voice controlled interfaces have vastly improved the usability of many devices (e.g., headless IoT systems). Unfortunately, the lack of authentication for these interfaces has also introduced command injection vulnerabilities - whether via compromised IoT devices, television ads or simply malicious nearby neighbors, causing such devices to perform unauthenticated sensitive commands is relatively easy. We address these weaknesses with Two Microphone Authentication (2MA), which takes advantage of the presence of multiple ambient and personal devices operating in the same area. We develop an embodiment of 2MA that combines approximate localization through Direction of Arrival (DOA) techniques with Robust Audio Hashes (RSHs). Our results show that our 2MA system can localize a source to within a narrow physical cone (< 30◦) with zero false positives, eliminate replay attacks and prevent the injection of inaudible/hidden commands. As such, we dramatically increase the difficulty for an adversary to carry out such attacks and demonstrate that 2MA is an effective means of authenticating and localizing voice commands. 
    more » « less
  2. Smart speakers come with always-on microphones to facilitate voice-based interaction. To address user privacy concerns, existing devices come with a number of privacy features: e.g., mute buttons and local trigger-word detection modules. But it is difficult for users to trust that these manufacturer-provided privacy features actually work given that there is a misalignment of incentives: Google, Meta, and Amazon benefit from collecting personal data and users know it. What’s needed is perceptible assurance — privacy features that users can, through physical perception, verify actually work. To that end, we introduce, implement, and evaluate the idea of “intentionally-powered” microphones to provide users with perceptible assurance of privacy with smart speakers. We employed an iterative-design process to develop Candid Mic, a battery-free, wireless microphone that can only be powered by harvesting energy from intentional user interactions. Moreover, users can visually inspect the (dis)connection between the energy harvesting module and the microphone. Through a within-subjects experiment, we found that Candid Mic provides users with perceptible assurance about whether the microphone is capturing audio or not, and improves user trust in using smart speakers relative to mute button interfaces. 
    more » « less
  3. Voice controlled interactive smart speakers, such as Google Home, Amazon Echo, and Apple HomePod are becoming commonplace in today's homes. These devices listen continually for the user commands, that are triggered by special keywords, such as "Alexa" and "Hey Siri". Recent research has shown that these devices are vulnerable to attacks through malicious voice commands from nearby devices. The commands can be sent easily during unoccupied periods, so that the user may be unaware of such attacks. We present EchoSafe, a user-friendly sonar-based defense against these attacks. When the user sends a critical command to the smart speaker, EchoSafe sends an audio pulse followed by post processing to determine if the user is present in the room. We can detect the user's presence during critical commands with 93.13% accuracy, and our solution can be extended to defend against other attack scenarios, as well. 
    more » « less
  4. The proliferation of the Internet of Things has increased reliance on voice-controlled devices to perform everyday tasks. Although these devices rely on accurate speech recognition for correct functionality, many users experience frequent misinterpretations in normal use. In this work, we conduct an empirical analysis of interpretation errors made by Amazon Alexa, the speech-recognition engine that powers the Amazon Echo family of devices. We leverage a dataset of 11,460 speech samples containing English words spoken by American speakers and identify where Alexa misinterprets the audio inputs, how often, and why. We find that certain misinterpretations appear consistently in repeated trials and are systematic. Next, we present and validate a new attack, called skill squatting. In skill squatting, an attacker leverages systematic errors to route a user to malicious application without their knowledge. In a variant of the attack we call spear skill squatting, we further demonstrate that this attack can be targeted at specific demographic groups. We conclude with a discussion of the security implications of speech interpretation errors, countermeasures, and future work. 
    more » « less
  5. Encrypted voice-over-IP (VoIP) communication often uses variable bit rate (VBR) codecs to achieve good audio quality while minimizing bandwidth costs. Prior work has shown that encrypted VBR-based VoIP streams are vulnerable to re-identification attacks in which an attacker can infer attributes (e.g., the language being spoken, the identities of the speakers, and key phrases) about the underlying audio by analyzing the distribution of packet sizes. Existing defenses require the participation of both the sender and receiver to secure their VoIP communications. This paper presents Whisper, the first unilateral defense against re-identification attacks on encrypted VoIP streams. Whisper works by modifying the audio signal before it is encoded by the VBR codec, adding inaudible audio that either falls outside the fixed range of human hearing or is within the human audible range but is nearly imperceptible due to its low amplitude. By carefully inserting such noise, Whisper modifies the audio stream's distribution of packet sizes, significantly decreasing the accuracy of re-identification attacks. Its use is imperceptible by the (human) receiver. Whisper can be instrumented as an audio driver and requires no changes to existing (potentially closed-source) VoIP software. Since it is a unilateral defense, it can be applied at will by a user to enhance the privacy of its voice communications. We demonstrate that Whisper significantly reduces the accuracy of re-identification attacks and incurs only a small degradation in audio quality. 
    more » « less