skip to main content


Title: I Spy You: Eavesdropping Continuous Speech on Smartphones via Motion Sensors.
This paper presents iSpyU, a system that shows the feasibility of recognition of natural speech content played on a phone during conference calls (Skype, Zoom, etc) using a fusion of motion sensors such as accelerometer and gyroscope. While microphones require permissions from the user to be accessible by an app developer, the motion sensors are zero-permission sensors, thus accessible by a developer without alerting the user. This allows a malicious app to potentially eavesdrop on sensitive speech content played by the user's phone. In designing the attack, iSpyU tackles a number of technical challenges including: (i) Low sampling rate of motion sensors (500 Hz in comparison to 44 kHz for a microphone). (ii) Lack of availability of large-scale training datasets to train models for Automatic Speech Recognition (ASR) with motion sensors. iSpyU systematically addresses these challenges by a combination of techniques in synthetic training data generation, ASR modeling, and domain adaptation. Extensive measurement studies on modern smartphones show a word level accuracy of 53.3 - 59.9% over a dictionary of 2000-10000 words, and a character level accuracy of 70.0 - 74.8%. We believe such levels of accuracy poses a significant threat when viewed from a privacy perspective.  more » « less
Award ID(s):
2008384
PAR ID:
10395301
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Volume:
6
Issue:
4
ISSN:
2474-9567
Page Range / eLocation ID:
1 to 31
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Measuring how well human listeners recognize speech under varying environmental conditions (speech intelligibility) is a challenge for theoretical, technological, and clinical approaches to speech communication. The current gold standard—human transcription—is time- and resource-intensive. Recent advances in automatic speech recognition (ASR) systems raise the possibility of automating intelligibility measurement. This study tested 4 state-of-the-art ASR systems with second language speech-in-noise and found that one, whisper, performed at or above human listener accuracy. However, the content of whisper's responses diverged substantially from human responses, especially at lower signal-to-noise ratios, suggesting both opportunities and limitations for ASR--based speech intelligibility modeling.

     
    more » « less
  2. Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams. 
    more » « less
  3. Automatic speech recognition (ASR) systems for children have lagged behind in performance when compared to adult ASR. The exact problems and evaluation methods for child ASR have not yet been fully investigated. Recent work from the robotics community suggests that ASR for kindergarten speech is especially difficult, even though this age group may benefit most from voice-based educational and diagnostic tools. Our study focused on ASR performance for specific grade levels (K-10) using a word identification task. Grade-specific ASR systems were evaluated, with particular attention placed on the evaluation of kindergarten-aged children (5-6 years old). Experiments included investigation of grade-specific interactions with triphone models using feature space maximum likelihood linear regression (fMLLR), vocal tract length normalization (VTLN), and subglottal resonance (SGR) normalization. Our results indicate that kindergarten ASR performs dramatically worse than even 1st grade ASR, likely due to large speech variability at that age. As such, ASR systems may require targeted evaluations on kindergarten speech rather than being evaluated under the guise of “child ASR.” Additionally, results show that systems trained in matched conditions on kindergarten speech may be less suitable than mismatched-grade training with 1st grade speech. Finally, we analyzed the phonetic errors made by the kindergarten ASR. 
    more » « less
  4. The application of deep learning to automatic speech recognition (ASR) has yielded dramatic accuracy increases for languages with abundant training data, but languages with limited training resources have yet to see accuracy improvements on this scale. In this paper, we compare a fully convolutional approach for acoustic modelling in ASR with a variety of established acoustic modeling approaches. We evaluate our method on Seneca, a low-resource endangered language spoken in North America. Our method yields word error rates up to 40% lower than those reported using both standard GMM-HMM approaches and established deep neural methods, with a substantial reduction in training time. These results show particular promise for languages like Seneca that are both endangered and lack extensive documentation. 
    more » « less
  5. Children’s automatic speech recognition (ASR) is always difficult due to, in part, the data scarcity problem, especially for kindergarten-aged kids. When data are scarce, the model might overfit to the training data, and hence good starting points for training are essential. Recently, meta-learning was proposed to learn model initialization (MI) for ASR tasks of different languages. This method leads to good performance when the model is adapted to an unseen language. How-ever, MI is vulnerable to overfitting on training tasks (learner overfitting). It is also unknown whether MI generalizes to other low-resource tasks. In this paper, we validate the effectiveness of MI in children’s ASR and attempt to alleviate the problem of learner overfitting. To achieve model-agnostic meta-learning (MAML), we regard children’s speech at each age as a different task. In terms of learner overfitting, we propose a task-level augmentation method by simulating new ages using frequency warping techniques. Detailed experiments are conducted to show the impact of task augmentation on each age for kindergarten-aged speech. As a result, our approach achieves a relative word error rate (WER) improvement of 51% over the baseline system with no augmentation or initialization. 
    more » « less