skip to main content


Title: Towards Vulnerability Analysis of Voice-Driven Interfaces and Countermeasures for Replay Attacks
Fake audio detection is expected to become an important research area in the field of smart speakers such as Google Home, Amazon Echo and chatbots developed for these platforms. This paper presents replay attack vulnerability of voice-driven interfaces and proposes a countermeasure to detect replay attack on these platforms. This paper introduces a novel framework to model replay attack distortion, and then use a non-learning-based method for replay attack detection on smart speakers. The reply attack distortion is modeled as a higher-order nonlinearity in the replay attack audio. Higher-order spectral analysis (HOSA) is used to capture characteristics distortions in the replay audio. The replay attack recordings are successfully injected into the Google Home device via Amazon Alexa using the drop-in conferencing feature. Effectiveness of the proposed HOSA-based scheme is evaluated using original recorded speech as well as corresponding played back recording to the Google Home via the Amazon Alexa using the drop-in conferencing feature.  more » « less
Award ID(s):
1815724 1816019
NSF-PAR ID:
10097312
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)
Page Range / eLocation ID:
523 to 528
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Voice controlled interactive smart speakers, such as Google Home, Amazon Echo, and Apple HomePod are becoming commonplace in today's homes. These devices listen continually for the user commands, that are triggered by special keywords, such as "Alexa" and "Hey Siri". Recent research has shown that these devices are vulnerable to attacks through malicious voice commands from nearby devices. The commands can be sent easily during unoccupied periods, so that the user may be unaware of such attacks. We present EchoSafe, a user-friendly sonar-based defense against these attacks. When the user sends a critical command to the smart speaker, EchoSafe sends an audio pulse followed by post processing to determine if the user is present in the room. We can detect the user's presence during critical commands with 93.13% accuracy, and our solution can be extended to defend against other attack scenarios, as well. 
    more » « less
  2. Smart speaker voice assistants (VAs) such as Amazon Echo and Google Home have been widely adopted due to their seamless integration with smart home devices and the Internet of Things (IoT) technologies. These VA services raise privacy concerns, especially due to their access to our speech. This work considers one such use case: the unaccountable and unauthorized surveillance of a user's emotion via speech emotion recognition (SER). This paper presents DARE-GP, a solution that creates additive noise to mask users' emotional information while preserving the transcription-relevant portions of their speech. DARE-GP does this by using a constrained genetic programming approach to learn the spectral frequency traits that depict target users' emotional content, and then generating a universal adversarial audio perturbation that provides this privacy protection. Unlike existing works, DARE-GP provides: a) real-time protection of previously unheard utterances, b) against previously unseen black-box SER classifiers, c) while protecting speech transcription, and d) does so in a realistic, acoustic environment. Further, this evasion is robust against defenses employed by a knowledgeable adversary. The evaluations in this work culminate with acoustic evaluations against two off-the-shelf commercial smart speakers using a small-form-factor (raspberry pi) integrated with a wake-word system to evaluate the efficacy of its real-world, real-time deployment.

     
    more » « less
  3. The platformization of households is increasingly possible with the introduction of “intelligent personal assistants” (IPAs) embedded in smart, always-listening speakers and screens, such as Google Home and the Amazon Echo. These devices exemplify Zuboff’s “surveillance capitalism” by commodifying familial and social spaces and funneling data into corporate networks. However, the motivations driving the development of these platforms—and the dataveillance they afford—vary: Amazon appears focused on collecting user data to drive personalized sales across its shopping platform, while Google relies on its vast dataveillance infrastructure to build its AI-driven targeted advertising platform. This paper draws on cross-cultural focus groups regarding IPAs in the Netherlands and the United States. It reveals how respondents in these two countries articulate divergent ways of negotiating the dataveillance affordances and privacy concerns of these IPA platforms. These findings suggest the need for a nuanced approach to combating and limiting the potential harms of these home devices, which may otherwise be seen as equivalents. 
    more » « less
  4. Smart speakers come with always-on microphones to facilitate voice-based interaction. To address user privacy concerns, existing devices come with a number of privacy features: e.g., mute buttons and local trigger-word detection modules. But it is difficult for users to trust that these manufacturer-provided privacy features actually work given that there is a misalignment of incentives: Google, Meta, and Amazon benefit from collecting personal data and users know it. What’s needed is perceptible assurance — privacy features that users can, through physical perception, verify actually work. To that end, we introduce, implement, and evaluate the idea of “intentionally-powered” microphones to provide users with perceptible assurance of privacy with smart speakers. We employed an iterative-design process to develop Candid Mic, a battery-free, wireless microphone that can only be powered by harvesting energy from intentional user interactions. Moreover, users can visually inspect the (dis)connection between the energy harvesting module and the microphone. Through a within-subjects experiment, we found that Candid Mic provides users with perceptible assurance about whether the microphone is capturing audio or not, and improves user trust in using smart speakers relative to mute button interfaces. 
    more » « less
  5. The proliferation of the Internet of Things has increased reliance on voice-controlled devices to perform everyday tasks. Although these devices rely on accurate speech recognition for correct functionality, many users experience frequent misinterpretations in normal use. In this work, we conduct an empirical analysis of interpretation errors made by Amazon Alexa, the speech-recognition engine that powers the Amazon Echo family of devices. We leverage a dataset of 11,460 speech samples containing English words spoken by American speakers and identify where Alexa misinterprets the audio inputs, how often, and why. We find that certain misinterpretations appear consistently in repeated trials and are systematic. Next, we present and validate a new attack, called skill squatting. In skill squatting, an attacker leverages systematic errors to route a user to malicious application without their knowledge. In a variant of the attack we call spear skill squatting, we further demonstrate that this attack can be targeted at specific demographic groups. We conclude with a discussion of the security implications of speech interpretation errors, countermeasures, and future work. 
    more » « less