Voice-controlled interfaces are essential in modern smart devices, but they remain vulnerable to replay attacks that compromise voice authentication systems. Existing voice liveness detection methods often struggle to distinguish human speech from replayed audio. This paper introduces a novel approach, LiveGuard, utilizing wavelet scattering transform (WST) and Mel spectrogram scaling with a lightweight ResNet architecture to enhance voice liveness detection. WST captures robust hierarchical features, while Mel spectrogram scaling extracts fine-grained acoustic details, which the lightweight ResNet efficiently processes to identify live voice. Experimental results demonstrate accuracy improvements of 6% with WST and Mel spectrogram scaling, achieving a top accuracy of 97.17% on POCO dataset. Meanwhile, LiveGuard demonstrates superior performance on ASVspoof2019 and ASVspoof2021 benchmarks. It achieves the lowest equal error rate (EER) of 0.13%, and a min t-DCF of 0.00126 on ASVspoof2019, and an EER of 0.42% on ASVspoof2021, surpassing state-of-the-art methods.
more »
« less
A Continuous Articulatory Gesture Based Liveness Detection for Voice Authentication on Smart Devices
Voice biometrics is drawing increasing attention to user authentication on smart devices. However, voice biometrics is vulnerable to replay attacks, where adversaries try to spoof voice authentication systems using pre-recorded voice samples collected from genuine users. To this end, we propose VoiceGesture, a liveness detection solution for voice authentication on smart devices such as smartphones and smart speakers. With audio hardware advances on smart devices, VoiceGesture leverages built-in speaker and microphone pairs on smart devices as Doppler Radar to sense articulatory gestures for liveness detection during voice authentication. The experiments with 21 participants and different smart devices show that VoiceGesture achieves over 99% and around 98% detection accuracy for text-dependent and text-independent liveness detection, respectively. Moreover, VoiceGesture is robust to different device placements, low audio sampling frequency, and supports medium range liveness detection on smart speakers in various use scenarios, including smart homes and smart vehicles.
more »
« less
- PAR ID:
- 10360837
- Date Published:
- Journal Name:
- IEEE Internet of Things Journal
- ISSN:
- 2372-2541
- Page Range / eLocation ID:
- 1 to 14
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)In recent years, biometrics (e.g., fingerprint or face recognition) has replaced traditional passwords and PINs as a widely used method for user authentication, particularly in personal or mobile devices. Differing from state-of-the-art biometrics, heart biometrics offer the advantages of liveness detection, which provides strong tolerance to spoofing attacks. To date, several authentication methods primarily focusing on electrocardiogram (ECG) have demonstrated remarkable success; however, the degree of exploration with other cardiac signals is still limited. To this end, we discuss the challenges in various cardiac domains and propose future prospectives for developing effective heart biometrics systems in real-world applications.more » « less
-
The popularity of smart home devices has led to an increase in security incidents happening in smart homes. A key measure to avoid such incidents is to authenticate users before they can interact with smart devices. However, current methods often require additional hardware. This article proposes STATION, a gesture-based authentication system, an effective gesture-based authentication method built on top of the voice interfaces already available in these smart home devices, without adding new hardware. STATION uses a gesture processing pipeline that identifies Doppler-existing frames and detects the direction of arrival of Reflection to authenticate users in low SNR environments and at longer distances. Furthermore, regarding the nature of gesture-based authentication, this system also supports detecting user liveness, preventing replay and synthesis attacks from remote attackers. The evaluation of STATION shows high accuracy with a false acceptance rate (FAR) of 0.08% and false rejection rate (FRR) of 3.10% for users within 1.5 m of the device.more » « less
-
Abstract Internet-connected voice-controlled speakers, also known as smart speakers , are increasingly popular due to their convenience for everyday tasks such as asking about the weather forecast or playing music. However, such convenience comes with privacy risks: smart speakers need to constantly listen in order to activate when the “wake word” is spoken, and are known to transmit audio from their environment and record it on cloud servers. In particular, this paper focuses on the privacy risk from smart speaker misactivations , i.e. , when they activate, transmit, and/or record audio from their environment when the wake word is not spoken. To enable repeatable, scalable experiments for exposing smart speakers to conversations that do not contain wake words, we turn to playing audio from popular TV shows from diverse genres. After playing two rounds of 134 hours of content from 12 TV shows near popular smart speakers in both the US and in the UK, we observed cases of 0.95 misactivations per hour, or 1.43 times for every 10,000 words spoken, with some devices having 10% of their misactivation durations lasting at least 10 seconds. We characterize the sources of such misactivations and their implications for consumers, and discuss potential mitigations.more » « less
-
Fake audio detection is expected to become an important research area in the field of smart speakers such as Google Home, Amazon Echo and chatbots developed for these platforms. This paper presents replay attack vulnerability of voice-driven interfaces and proposes a countermeasure to detect replay attack on these platforms. This paper introduces a novel framework to model replay attack distortion, and then use a non-learning-based method for replay attack detection on smart speakers. The reply attack distortion is modeled as a higher-order nonlinearity in the replay attack audio. Higher-order spectral analysis (HOSA) is used to capture characteristics distortions in the replay audio. The replay attack recordings are successfully injected into the Google Home device via Amazon Alexa using the drop-in conferencing feature. Effectiveness of the proposed HOSA-based scheme is evaluated using original recorded speech as well as corresponding played back recording to the Google Home via the Amazon Alexa using the drop-in conferencing feature.more » « less
An official website of the United States government

