skip to main content


Title: Kirigami: Lightweight Speech Filtering for Privacy-Preserving Activity Recognition using Audio
Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) approaches. These audio-based ML HAR pipelines often use common featurization techniques, such as extracting various statistical and spectral features by converting time domain signals to the frequency domain (using an FFT) and using them to train ML models. Some of these approaches also claim privacy benefits by preventing the identification of human speech. However, recent deep learning-based automatic speech recognition (ASR) models pose new privacy challenges to these featurization techniques. In this paper, we systematically evaluate various featurization approaches for audio data, assessing their privacy risks through metrics like speech intelligibility (PER and WER) while considering the utility tradeoff in terms of ML-based activity recognition accuracy. Our findings reveal the susceptibility of these approaches to speech content recovery when exposed to recent ASR models, especially under re-tuning or retraining conditions. Notably, fine-tuned ASR models achieved an average Phoneme Error Rate (PER) of 39.99% and Word Error Rate (WER) of 44.43% in speech recognition for these approaches. To overcome these privacy concerns, we propose Kirigami, a lightweight machine learning-based audio speech filter that removes human speech segments reducing the efficacy of ASR models (70.48% PER and 101.40% WER) while also maintaining HAR accuracy (76.0% accuracy). We show that Kirigami can be implemented on common edge microcontrollers with limited computational capabilities and memory, providing a path to deployment on a variety of IoT devices. Finally, we conducted a real-world user study and showed the robustness of Kirigami on a laptop and an ARM Cortex-M4F microcontroller under three different background noises.  more » « less
Award ID(s):
1801472
PAR ID:
10571899
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM IMWUT
Date Published:
Journal Name:
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Volume:
8
Issue:
1
ISSN:
2474-9567
Page Range / eLocation ID:
1 to 28
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The use of audio and video modalities for Human Activity Recognition (HAR) is common, given the richness of the data and the availability of pre-trained ML models using a large corpus of labeled training data. However, audio and video sensors also lead to significant consumer privacy concerns. Researchers have thus explored alternate modalities that are less privacy-invasive such as mmWave doppler radars, IMUs, motion sensors. However, the key limitation of these approaches is that most of them do not readily generalize across environments and require significant in-situ training data. Recent work has proposed cross-modality transfer learning approaches to alleviate the lack of trained labeled data with some success. In this paper, we generalize this concept to create a novel system called VAX (Video/Audio to 'X'), where training labels acquired from existing Video/Audio ML models are used to train ML models for a wide range of 'X' privacy-sensitive sensors. Notably, in VAX, once the ML models for the privacy-sensitive sensors are trained, with little to no user involvement, the Audio/Video sensors can be removed altogether to protect the user's privacy better. We built and deployed VAX in ten participants' homes while they performed 17 common activities of daily living. Our evaluation results show that after training, VAX can use its onboard camera and microphone to detect approximately 15 out of 17 activities with an average accuracy of 90%. For these activities that can be detected using a camera and a microphone, VAX trains a per-home model for the privacy-preserving sensors. These models (average accuracy = 84%) require no in-situ user input. In addition, when VAX is augmented with just one labeled instance for the activities not detected by the VAX A/V pipeline (~2 out of 17), it can detect all 17 activities with an average accuracy of 84%. Our results show that VAX is significantly better than a baseline supervised-learning approach of using one labeled instance per activity in each home (average accuracy of 79%) since VAX reduces the user burden of providing activity labels by 8x (~2 labels vs. 17 labels).

     
    more » « less
  2. Mitrovic, Antonija ; Bosch, Nigel (Ed.)
    Classroom environments are challenging for artificially intelligent agents primarily because classroom noise dilutes the interpretability and usefulness of gathered data. This problem is exacerbated when groups of students participate in collaborative problem solving (CPS). Here, we examine how well six popular microphones capture audio from individual groups. A primary usage of audio data is automatic speech recognition (ASR), therefore we evaluate our recordings by examining the accuracy of downstream ASR using the Google Cloud Platform. We simultaneously captured the audio of all microphones for 11 unique groups of three participants first reading a prepared script, and then participating in a collaborative problem solving exercise. We vary participants, noise conditions, and speech contexts. Transcribed speech was evaluated using word error rate (WER). We find that scripted speech is transcribed with a surprisingly high degree of accuracy across groups (average WER = 0.114, SD = 0.044). However, the CPS task was much more difficult (average WER = 0.570, SD = 0.143). We found most microphones were robust to background noise below a certain threshold, but the AT-Cardioid and ProCon microphones were more robust to higher noise levels. Finally, an analysis of errors revealed that most errors were due to the ASR missing words/phrases, rather than mistranscribing them. We conclude with recommendations based on our observations. 
    more » « less
  3. Speech and language development in children are crucial for ensuring effective skills in their long-term learning ability. A child’s vocabulary size at the time of entry into kindergarten is an early indicator of their learning ability to read and potential long-term success in school. The preschool classroom is thus a promising venue for assessing growth in young children by measuring their interactions with teachers as well as classmates. However, to date limited studies have explored such naturalistic audio communications. Automatic Speech Recognition (ASR) technologies provide an opportunity for ’Early Childhood’ researchers to obtain knowledge through automatic analysis of naturalistic classroom recordings in measuring such interactions. For this purpose, 208 hours of audio recordings across 48 daylong sessions are collected in a childcare learning center in the United States using Language Environment Analysis (LENA) devices worn by the preschool children. Approximately 29 hours of adult speech and 26 hours of child speech is segmented using manual transcriptions provided by CRSS transcription team. Traditional as well as End-to-End ASR models are trained on adult/child speech data subset. Factorized Time Delay Neural Network provides a best Word-Error-Rate (WER) of 35.05% on the adult subset of the test set. End-to-End transformer models achieve 63.5% WER on the child subset of the test data. Next, bar plots demonstrating the frequency of WH-question words in Science vs. Reading activity areas of the preschool are presented for sessions in the test set. It is suggested that learning spaces could be configured to encourage greater adult-child conversational engagement given such speech/audio assessment strategies. 
    more » « less
  4. Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings. 
    more » « less
  5. Driven by the development of machine learning and the development of wireless techniques, lots of research efforts have been spent on the human activity recognition (HAR). Although various deep learning algorithms can achieve high accuracy for recognizing human activities, existing works lack of a theoretical performance upper bound which is the best accuracy that is only limited by the influencing factors in wireless networks such as indoor physical environments and settings of wireless sensing devices regardless of any HAR algorithm. Without the understanding of performance upper bound, mistakenly configuring the influencing factors can reduce the HAR accuracy drastically no matter what deep learning algorithms are utilized. In this paper, we propose the HAR performance upper bound which is the minimum classification error probability that doesn't depend on any HAR algorithms and can be considered as a function of influencing factors in wireless sensing networks for CSI based human activity recognition. Since the performance upper bound can capture the impacts of influencing factors on HAR accuracy, we further analyze the influences of those factors with varying situations such as through the wall HAR and different human activities by MATLAB simulations. 
    more » « less