skip to main content

This content will become publicly available on September 27, 2024

Title: VAX: Using Existing Video and Audio-based Activity Recognition Models to Bootstrap Privacy-Sensitive Sensors

The use of audio and video modalities for Human Activity Recognition (HAR) is common, given the richness of the data and the availability of pre-trained ML models using a large corpus of labeled training data. However, audio and video sensors also lead to significant consumer privacy concerns. Researchers have thus explored alternate modalities that are less privacy-invasive such as mmWave doppler radars, IMUs, motion sensors. However, the key limitation of these approaches is that most of them do not readily generalize across environments and require significant in-situ training data. Recent work has proposed cross-modality transfer learning approaches to alleviate the lack of trained labeled data with some success. In this paper, we generalize this concept to create a novel system called VAX (Video/Audio to 'X'), where training labels acquired from existing Video/Audio ML models are used to train ML models for a wide range of 'X' privacy-sensitive sensors. Notably, in VAX, once the ML models for the privacy-sensitive sensors are trained, with little to no user involvement, the Audio/Video sensors can be removed altogether to protect the user's privacy better. We built and deployed VAX in ten participants' homes while they performed 17 common activities of daily living. Our evaluation results show that after training, VAX can use its onboard camera and microphone to detect approximately 15 out of 17 activities with an average accuracy of 90%. For these activities that can be detected using a camera and a microphone, VAX trains a per-home model for the privacy-preserving sensors. These models (average accuracy = 84%) require no in-situ user input. In addition, when VAX is augmented with just one labeled instance for the activities not detected by the VAX A/V pipeline (~2 out of 17), it can detect all 17 activities with an average accuracy of 84%. Our results show that VAX is significantly better than a baseline supervised-learning approach of using one labeled instance per activity in each home (average accuracy of 79%) since VAX reduces the user burden of providing activity labels by 8x (~2 labels vs. 17 labels).

more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Publisher / Repository:
Date Published:
Journal Name:
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Page Range / eLocation ID:
1 to 24
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Video conferencing apps (VCAs) make it possible for previously private spaces -- bedrooms, living rooms, and kitchens -- into semi-public extensions of the office. For the most part, users have accepted these apps in their personal space without much thought about the permission models that govern the use of their private data during meetings. While access to a device's video camera is carefully controlled, little has been done to ensure the same level of privacy for accessing the microphone. In this work, we ask the question: what happens to the microphone data when a user clicks the mute button in a VCA? We first conduct a user study to analyze users' understanding of the permission model of the mute button. Then, using runtime binary analysis tools, we trace raw audio flow in many popular VCAs as it traverses the app from the audio driver to the network. We find fragmented policies for dealing with microphone data among VCAs -- some continuously monitor the microphone input during mute, and others do so periodically. One app transmits statistics of the audio to its telemetry servers while the app is muted. Using network traffic that we intercept en route to the telemetry server, we implement a proof-of-concept background activity classifier and demonstrate the feasibility of inferring the ongoing background activity during a meeting -- cooking, cleaning, typing, etc. We achieved 81.9% macro accuracy on identifying six common background activities using intercepted outgoing telemetry packets when a user is muted. 
    more » « less
  2. Abstract

    Objective.Recent advances in neural decoding have accelerated the development of brain–computer interfaces aimed at assisting users with everyday tasks such as speaking, walking, and manipulating objects. However, current approaches for training neural decoders commonly require large quantities of labeled data, which can be laborious or infeasible to obtain in real-world settings. Alternatively, self-supervised models that share self-generated pseudo-labels between two data streams have shown exceptional performance on unlabeled audio and video data, but it remains unclear how well they extend to neural decoding.Approach.We learn neural decoders without labels by leveraging multiple simultaneously recorded data streams, including neural, kinematic, and physiological signals. Specifically, we apply cross-modal, self-supervised deep clustering to train decoders that can classify movements from brain recordings. After training, we then isolate the decoders for each input data stream and compare the accuracy of decoders trained using cross-modal deep clustering against supervised and unimodal, self-supervised models.Main results.We find that sharing pseudo-labels between two data streams during training substantially increases decoding performance compared to unimodal, self-supervised models, with accuracies approaching those of supervised decoders trained on labeled data. Next, we extend cross-modal decoder training to three or more modalities, achieving state-of-the-art neural decoding accuracy that matches or slightly exceeds the performance of supervised models.Significance.We demonstrate that cross-modal, self-supervised decoding can be applied to train neural decoders when few or no labels are available and extend the cross-modal framework to share information among three or more data streams, further improving self-supervised training.

    more » « less
  3. null (Ed.)
    Transitioning from one activity to another is oneof the key challenges of prosthetic control. Vision sensorsprovide a glance into the environment’s desired and futuremovements, unlike body sensors (EMG, mechanical). This couldbe employed to anticipate and trigger transitions in prosthesisto provide a smooth user experience.A significant bottleneck in using vision sensors has beenthe acquisition of large labeled training data. Labeling theterrain in thousands of images is labor-intensive; it would beideal to simply collect visual data for long periods withoutneeding to label each frame. Toward that goal, we apply anunsupervised learning method to generate mode labels forkinematic gait cycles in training data. We use these labels withimages from the same training data to train a vision classifier.The classifier predicts the target mode an average of 2.2 secondsbefore the kinematic changes. We report 96.6% overall and99.5% steady-state mode classification accuracy. These resultsare comparable to studies using manually labeled data. Thismethod, however, has the potential to dramatically scale withoutrequiring additional labeling. 
    more » « less
  4. Most attention in K-12 artificial intelligence and machine learning (AI/ML) education has been given to having youths train models, with much less attention to the equally important testing of models when creating machine learning applications. Testing ML applications allows for the evaluation of models against predictions and can help creators of applications identify and address failure and edge cases that could negatively impact user experiences. We investigate how testing each other's projects supported youths to take perspective about functionality, performance, and potential issues in their own projects. We analyzed testing worksheets, audio and video recordings collected during a two week workshop in which 11 high school youths created physical computing projects that included (audio, pose, and image) ML classifiers. We found that through peer-testing youths reflected on the size of their training datasets, the diversity of their training data, the design of their classes and the contexts in which they produced training data. We discuss future directions for research on peer-testing in AI/ML education and current limitations for these kinds of activities. 
    more » « less
  5. Precise and eloquent label information is fundamental for interpreting the underlying data distributions distinctively and training of supervised and semi-supervised learning models adequately. But obtaining large amount of labeled data demands substantial manual effort. This obligation can be mitigated by acquiring labels of most informative data instances using Active Learning. However labels received from humans are not always reliable and poses the risk of introducing noisy class labels which will degrade the efficacy of a model instead of its improvement. In this paper, we address the problem of annotating sensor data instances of various Activities of Daily Living (ADLs) in smart home context. We exploit the interactions between the users and annotators in terms of relationships spanning across spatial and temporal space which accounts for an activity as well. We propose a novel annotator selection model SocialAnnotator which exploits the interactions between the users and annotators and rank the annotators based on their level of correspondence. We also introduce a novel approach to measure this correspondence distance using the spatial and temporal information of interactions, type of the relationships and activities. We validate our proposed SocialAnnotator framework in smart environments achieving ≈ 84% statistical confidence in data annotation 
    more » « less