skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: I Spy You: Eavesdropping Continuous Speech on Smartphones via Motion Sensors.
This paper presents iSpyU, a system that shows the feasibility of recognition of natural speech content played on a phone during conference calls (Skype, Zoom, etc) using a fusion of motion sensors such as accelerometer and gyroscope. While microphones require permissions from the user to be accessible by an app developer, the motion sensors are zero-permission sensors, thus accessible by a developer without alerting the user. This allows a malicious app to potentially eavesdrop on sensitive speech content played by the user's phone. In designing the attack, iSpyU tackles a number of technical challenges including: (i) Low sampling rate of motion sensors (500 Hz in comparison to 44 kHz for a microphone). (ii) Lack of availability of large-scale training datasets to train models for Automatic Speech Recognition (ASR) with motion sensors. iSpyU systematically addresses these challenges by a combination of techniques in synthetic training data generation, ASR modeling, and domain adaptation. Extensive measurement studies on modern smartphones show a word level accuracy of 53.3 - 59.9% over a dictionary of 2000-10000 words, and a character level accuracy of 70.0 - 74.8%. We believe such levels of accuracy poses a significant threat when viewed from a privacy perspective.  more » « less
Award ID(s):
2008384
PAR ID:
10395301
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Volume:
6
Issue:
4
ISSN:
2474-9567
Page Range / eLocation ID:
1 to 31
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) approaches. These audio-based ML HAR pipelines often use common featurization techniques, such as extracting various statistical and spectral features by converting time domain signals to the frequency domain (using an FFT) and using them to train ML models. Some of these approaches also claim privacy benefits by preventing the identification of human speech. However, recent deep learning-based automatic speech recognition (ASR) models pose new privacy challenges to these featurization techniques. In this paper, we systematically evaluate various featurization approaches for audio data, assessing their privacy risks through metrics like speech intelligibility (PER and WER) while considering the utility tradeoff in terms of ML-based activity recognition accuracy. Our findings reveal the susceptibility of these approaches to speech content recovery when exposed to recent ASR models, especially under re-tuning or retraining conditions. Notably, fine-tuned ASR models achieved an average Phoneme Error Rate (PER) of 39.99% and Word Error Rate (WER) of 44.43% in speech recognition for these approaches. To overcome these privacy concerns, we propose Kirigami, a lightweight machine learning-based audio speech filter that removes human speech segments reducing the efficacy of ASR models (70.48% PER and 101.40% WER) while also maintaining HAR accuracy (76.0% accuracy). We show that Kirigami can be implemented on common edge microcontrollers with limited computational capabilities and memory, providing a path to deployment on a variety of IoT devices. Finally, we conducted a real-world user study and showed the robustness of Kirigami on a laptop and an ARM Cortex-M4F microcontroller under three different background noises. 
    more » « less
  2. Measuring how well human listeners recognize speech under varying environmental conditions (speech intelligibility) is a challenge for theoretical, technological, and clinical approaches to speech communication. The current gold standard—human transcription—is time- and resource-intensive. Recent advances in automatic speech recognition (ASR) systems raise the possibility of automating intelligibility measurement. This study tested 4 state-of-the-art ASR systems with second language speech-in-noise and found that one, whisper, performed at or above human listener accuracy. However, the content of whisper's responses diverged substantially from human responses, especially at lower signal-to-noise ratios, suggesting both opportunities and limitations for ASR--based speech intelligibility modeling. 
    more » « less
  3. Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams. 
    more » « less
  4. Automatic speech recognition (ASR) systems for children have lagged behind in performance when compared to adult ASR. The exact problems and evaluation methods for child ASR have not yet been fully investigated. Recent work from the robotics community suggests that ASR for kindergarten speech is especially difficult, even though this age group may benefit most from voice-based educational and diagnostic tools. Our study focused on ASR performance for specific grade levels (K-10) using a word identification task. Grade-specific ASR systems were evaluated, with particular attention placed on the evaluation of kindergarten-aged children (5-6 years old). Experiments included investigation of grade-specific interactions with triphone models using feature space maximum likelihood linear regression (fMLLR), vocal tract length normalization (VTLN), and subglottal resonance (SGR) normalization. Our results indicate that kindergarten ASR performs dramatically worse than even 1st grade ASR, likely due to large speech variability at that age. As such, ASR systems may require targeted evaluations on kindergarten speech rather than being evaluated under the guise of “child ASR.” Additionally, results show that systems trained in matched conditions on kindergarten speech may be less suitable than mismatched-grade training with 1st grade speech. Finally, we analyzed the phonetic errors made by the kindergarten ASR. 
    more » « less
  5. Abstract Despite thelack of invariance problem(the many‐to‐many mapping between acoustics and percepts), human listeners experience phonetic constancy and typically perceive what a speaker intends. Most models of human speech recognition (HSR) have side‐stepped this problem, working with abstract, idealized inputs and deferring the challenge of working with real speech. In contrast, carefully engineered deep learning networks allow robust, real‐world automatic speech recognition (ASR). However, the complexities of deep learning architectures and training regimens make it difficult to use them to provide direct insights into mechanisms that may support HSR. In this brief article, we report preliminary results from a two‐layer network that borrows one element from ASR,long short‐term memorynodes, which provide dynamic memory for a range of temporal spans. This allows the model to learn to map real speech from multiple talkers to semantic targets with high accuracy, with human‐like timecourse of lexical access and phonological competition. Internal representations emerge that resemble phonetically organized responses in human superior temporal gyrus, suggesting that the model develops a distributed phonological code despite no explicit training on phonetic or phonemic targets. The ability to work with real speech is a major advance for cognitive models of HSR. 
    more » « less