The ubiquitousness of smart and wearable devices with integrated acoustic sensors in modern human lives presents tremendous opportunities for recognizing human activities in our living spaces through ML-driven applications. However, their adoption is often hindered by the requirement of large amounts of labeled data during the model training phase. Integration of contextual metadata has the potential to alleviate this since the nature of these meta-data is often less dynamic (e.g. cleaning dishes, and cooking both can happen in the kitchen context) and can often be annotated in a less tedious manner (a sensor always placed in the kitchen). However, most models do not have good provisions for the integration of such meta-data information. Often, the additional metadata is leveraged in the form of multi-task learning with sub-optimal outcomes. On the other hand, reliably recognizing distinct in-home activities with similar acoustic patterns (e.g. chopping, hammering, knife sharpening) poses another set of challenges. To mitigate these challenges, we first show in our preliminary study that the room acoustics properties such as reverberation, room materials, and background noise leave a discernible fingerprint in the audio samples to recognize the room context and proposed AcouDL as a unified framework to exploit room context information to improve activity recognition performance. Our proposed self-supervision-based approach first learns the context features of the activities by leveraging a large amount of unlabeled data using a contrastive learning mechanism and then incorporates this feature induced with a novel attention mechanism into the activity classification pipeline to improve the activity recognition performance. Extensive evaluation of AcouDL on three datasets containing a wide range of activities shows that such an efficient feature fusion-mechanism enables the incorporation of metadata that helps to better recognition of the activities under challenging classification scenarios with 0.7-3.5% macro F1 score improvement over the baselines.
more »
« less
SoHAM: A Sound-Based Human Activity Monitoring Framework for Home Service Robots
Monitoring daily activities is essential for home service robots to take care of the older adults who live alone in their homes. In this article, we proposed a sound-based human activity monitoring (SoHAM) framework by recognizing sound events in a home environment. First, the method of context-aware sound event recognition (CoSER) is developed, which uses contextual information to disambiguate sound events. The locational context of sound events is estimated by fusing the data from the distributed passive infrared (PIR) sensors deployed in the home. A two-level dynamic Bayesian network (DBN) is used to model the intratemporal and intertemporal constraints between the context and the sound events. Second, dynamic sliding time window-based human action recognition (DTW-HaR) is developed to estimate active sound event segments with their labels and durations, then infer actions and their durations. Finally, a conditional random field (CRF) model is proposed to predict human activities based on the recognized action, location, and time. We conducted experiments in our robot-integrated smart home (RiSH) testbed to evaluate the proposed framework. The obtained results show the effectiveness and accuracy of CoSER, action recognition, and human activity monitoring.
more »
« less
- PAR ID:
- 10293409
- Date Published:
- Journal Name:
- IEEE Transactions on Automation Science and Engineering
- ISSN:
- 1545-5955
- Page Range / eLocation ID:
- 1 to 15
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Emerging wearable sensors have enabled the unprecedented ability to continuously monitor human activities for healthcare purposes. However, with so many ambient sensors collecting different measurements, it becomes important not only to maintain good monitoring accuracy, but also low power consumption to ensure sustainable monitoring. This power-efficient sensing scheme can be achieved by deciding which group of sensors to use at a given time, requiring an accurate characterization of the trade-off between sensor energy usage and the uncertainty in ignoring certain sensor signals while monitoring. To address this challenge in the context of activity monitoring, we have designed an adaptive activity monitoring framework. We first propose a switching Gaussian process to model the observed sensor signals emitting from the underlying activity states. To efficiently compute the Gaussian process model likelihood and quantify the context prediction uncertainty, we propose a block circulant embedding technique and use Fast Fourier Transforms (FFT) for inference. By computing the Bayesian loss function tailored to switching Gaussian processes, an adaptive monitoring procedure is developed to select features from available sensors that optimize the trade-off between sensor power consumption and the prediction performance quantified by state prediction entropy. We demonstrate the effectiveness of our framework on the popular benchmark of UCI Human Activity Recognition using Smartphones.more » « less
-
In clinical settings, most automatic recognition systems use visual or sensory data to recognize activities. These systems cannot recognize activities that rely on verbal assessment, lack visual cues, or do not use medical devices. We examined speech-based activity and activity-stage recognition in a clinical domain, making the following contributions. (1) We collected a high-quality dataset representing common activities and activity stages during actual trauma resuscitation events-the initial evaluation and treatment of critically injured patients. (2) We introduced a novel multimodal network based on audio signal and a set of keywords that does not require a high-performing automatic speech recognition (ASR) engine. (3) We designed novel contextual modules to capture dynamic dependencies in team conversations about activities and stages during a complex workflow. (4) We introduced a data augmentation method, which simulates team communication by combining selected utterances and their audio clips, and showed that this method contributed to performance improvement in our data-limited scenario. In offline experiments, our proposed context-aware multimodal model achieved F1-scores of 73.2±0.8% and 78.1±1.1% for activity and activity-stage recognition, respectively. In online experiments, the performance declined about 10% for both recognition types when using utterance-level segmentation of the ASR output. The performance declined about 15% when we omitted the utterance-level segmentation. Our experiments showed the feasibility of speech-based activity and activity-stage recognition during dynamic clinical events.more » « less
-
Activity recognition has applications in a variety of human-in-the-loop settings such as smart home health monitoring, green building energy and occupancy management, intelligent transportation, and participatory sensing. While fine-grained activity recognition systems and approaches help enable a multitude of novel applications, discovering them with non-intrusive ambient sensor systems pose challenging design, as well as data processing, mining, and activity recognition issues. In this paper, we develop a low-cost heterogeneous Radar based Activity Monitoring (RAM) system for recognizing fine-grained activities. We exploit the feasibility of using an array of heterogeneous micro-doppler radars to recognize low-level activities. We prototype a short-range and a long-range radar system and evaluate the feasibility of using the system for fine-grained activity recognition. In our evaluation, using real data traces, we show that our system can detect fine-grained user activities with 92.84% accuracy.more » « less
-
Video-language models (VLMs), large models pre-trained on numerous but noisy video-text pairs from the internet, have revolutionized activity recognition through their remarkable generalization and open-vocabulary capabilities. While complex human activities are often hierarchical and compositional, most existing tasks for evaluating VLMs focus only on high-level video understanding, making it difficult to accurately assess and interpret the ability of VLMs to understand complex and fine-grained human activities. Inspired by the recently proposed MOMA framework, we define activity graphs as a single universal representation of human activities that encompasses video understanding at the activity, sub10 activity, and atomic action level. We redefine activity parsing as the overarching task of activity graph generation, requiring understanding human activities across all three levels. To facilitate the evaluation of models on activity parsing, we introduce MOMA-LRG (Multi-Object Multi-Actor Language-Refined Graphs), a large dataset of complex human activities with activity graph annotations that can be readily transformed into natural language sentences. Lastly, we present a model-agnostic and lightweight approach to adapting and evaluating VLMs by incorporating structured knowledge from activity graphs into VLMs, addressing the individual limitations of language and graphical models. We demonstrate a strong performance on activity parsing and few-shot video classification, and our framework is intended to foster future research in the joint modeling of videos, graphs, and language.more » « less
An official website of the United States government

