skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Roy, Nirmalya"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Intelligent human motion analysis is essential for developing next-generation IoT and AR/VR systems that enable automated, interpretable, and fine-grained performance assessment. Motivated by the need for real-time, explainable, and transferable skill evaluation, we propose a wearable sensing framework to assess human performance by tracking skill progression and minimizing injury risk. We use live badminton gameplay and workout exercises as representative use cases, where motion dynamics, postural stability, and limb coordination are critical to success. Both activities demand optimal posture and synchronized limb movements, while improper actions or suboptimal technique can lead to decreased performance and higher injury susceptibility. We introduce SkillNet, a multi-task learning framework that extracts shared representations across all limbs while preserving limb-specific motion signatures. The architecture employs task-specific regressors to detect subtle inter-limb dissimilarities and distinctive traits, enabling collective inference in a body sensor network (BSN) environment. To holistically measure performance, we formulated a weighted performance indicator (PI) that fuses AI-driven scoring with domain-expert evaluations, providing a robust metric for both qualitative and quantitative assessment. We evaluate SkillNet on three diverse datasets Badminton Activity Recognition (BAR), Multi-Modalities Dataset of Sports (MMDOS), and Daily and Sports Activities (DSADS) capturing a broad spectrum of motion types and skill intensities. Results show that SkillNet achieves an R2 score of 86% and a mean squared error of 0.0093 in performance prediction. The integrated AI–expert scoring mechanism improves baseline performance estimation by 14.95%, demonstrating the advantage of combining human expertise with automated analysis. We further benchmark inference time, memory usage, and power consumption of the SkillNet, validating its efficiency and feasibility for real-time, end-to-end task inference on resource-constrained embedded edge devices, Jetson Nano and Jetson Xavier NX platforms. 
    more » « less
  2. Developing scalable wearable human activity recognition (wHAR) models is challenging due to domain shifts that substantially degrade performance across downstream tasks. Unsupervised domain adaptation (UDA) seeks to improve generalization by transferring knowledge from labeled source domains to unlabeled target domains. However, conventional UDA methods primarily align marginal feature distributions while neglecting feature–label dependencies, often leading to negative transfer and sub-optimal performance. Motivated by these limitations, we propose a novel optimization framework that tackles two key challenges: (i) generating reliable pseudo-labels for the unlabeled target domain and (ii) minimizing conditional discrepancies across domains. To address (i), we employ temperature-based entropy minimization (TEM), which calibrates prediction confidence by scaling logits with a temperature parameter to produce robust pseudo-labels. For (ii), we introduce a polynomial kernel-based cross-covariance (PkCC) loss, a high-order statistics–driven approach that maps features into a reproducing kernel hilbert space (RKHS) to capture richer feature-label dependencies and reduce conditional distribution gaps between domains. In addition, we demonstrate that CoDAN readily extends to partial UDA (pUDA), where the target label space is a subset of the source, and extensive evaluations on public wHAR datasets with diverse label spaces validate its superior performance over state-of-the-art methods in both UDA and pUDA scenarios. 
    more » « less
  3. Facial video recordings simultaneously encode respiratory rate (RR) and heart rate (HR) signals in temporal pixel intensity variations, yet their concurrent representation remains underexplored. This study introduces a physics-inspired framework to model the inter-play between pixel intensity shifts, driven by respiratory motion, and intensity variations, induced by cardiac diffusion signals. We present a simple yet effective mathematical model characterizing the coexistence of RR and HR signals in temporal pixel dynamics, of- fering robust criteria for signal extraction and artifact identification. Additionally, we develop a toolbox for estimating spatially localized HR and RR signals, enabling the identification of regions with the strongest physiological information. Validated on three real-world facial video datasets with diverse modalities, our framework quantifies signal presence, strength, and spatiotemporal distribution, enhancing the interpretability of physiological signal extraction. This work advances contactless healthcare applications by optimizing simultaneous RR and HR estimation while providing insights into artifact sources and signal quality. 
    more » « less
  4. In network-constrained environments, distributed multi-agent systems—such as UGVs and UAVs—must communicate effectively to support computationally demanding scene perception tasks like semantic and instance segmentation. These tasks are challenging because they require high accuracy even when using low-quality images, and the network limitations restrict the amount of data that can be transmitted between agents. To overcome the above challenges, we propose TAVIC-DAS to perform a task and channel-aware variable-rate image compression to enable distributed task execution and minimize communication latency by transmitting compressed images. TAVIC-DAS proposes a novel image compression and decompression framework (distributed across agents) that integrates channel parameters such as RSSI and data rate into a task-specific "semantic segmentation" DNN to generate masks representing the object of interest in the scene (ROI maps) by determining a high pixel density needed to represent objects of interest and low density to represents surrounding pixels within an image. Additionally, to accommodate agents with limited computational resources, TAVIC-DAS incorporates resource-aware model quantization. We evaluated TAVIC-DAS on platforms such as ROSMaster X3 and Jetson Xavier, which communicated using a low-frequency proprietary Doodle radio operating at 915 MHz. The experimental results show that TAVIC-DAS achieves approximately 7.62% higher PSNR and is about 6.39% more resource efficient compared to state-of-the-art techniques. 
    more » « less
  5. Integrating multimodal data such as RGB and LiDAR from multiple views significantly increases computational and communication demands, which can be challenging for resource-constrained autonomous agents while meeting the time-critical deadlines required for various mission-critical applications. To address this challenge, we propose CoOpTex, a collaborative task execution framework designed for cooperative perception in distributed autonomous systems (DAS). CoOpTex contribution is twofold: (a) CoOpTex fuses multiview RGB images to create a panoramic camera view for 2D object detection and utilizes 360° LiDAR for 3D object detection, improving accuracy with a lightweight Graph Neural Network (GNN) that integrates object coordinates from both perspectives, (b) To optimize task execution and meet the deadline, CoOpTex dynamically offloads computationally intensive image stitching tasks to auxiliary devices when available and adjusts frame capture rates for RGB frames based on device mobility and processing capabilities. We implement CoOpTex in real-time on static and mobile heterogeneous autonomous agents, which helps to significantly reduce deadline violations by 100% while improving frame rates for 2D detection by 2.2 times in stationary and 2 times in mobile conditions, demonstrating its effectiveness in enabling real-time cooperative perception. 
    more » « less
  6. Liang, Xuefeng (Ed.)
    Deep learning has achieved state-of-the-art video action recognition (VAR) performance by comprehending action-related features from raw video. However, these models often learn to jointly encode auxiliary view (viewpoints and sensor properties) information with primary action features, leading to performance degradation under novel views and security concerns by revealing sensor types and locations. Here, we systematically study these shortcomings of VAR models and develop a novel approach, VIVAR, to learn view-invariant spatiotemporal action features removing view information. In particular, we leverage contrastive learning to separate actions and jointly optimize adversarial loss that aligns view distributions to remove auxiliary view information in the deep embedding space using the unlabeled synchronous multiview (MV) video to learn view-invariant VAR system. We evaluate VIVAR using our in-house large-scale time synchronous MV video dataset containing 10 actions with three angular viewpoints and sensors in diverse environments. VIVAR successfully captures view-invariant action features, improves inter and intra-action clusters’ quality, and outperforms SoTA models consistently with 8% more accuracy. We additionally perform extensive studies with our datasets, model architectures, multiple contrastive learning, and view distribution alignments to provide VIVAR insights. We open-source our code and dataset to facilitate further research in view-invariant systems. 
    more » « less
  7. Monitoring respiratory rate (RR) is essential for early identification of respiratory and metabolic abnormalities. However, the limitations of contact-based sensors and the lack of reliability in many contactless methods make continuous and accurate monitoring difficult in non-clinical settings. To address these challenges, we introduce RespFormer, an edgeoptimized, motion-guided temporal-frequency multimodal fusion transformer framework for real-time, contactless RR estimation and breathing pattern classification. RespFormer integrates dense optical flow analysis with temporal, statistical, and frequencydomain features derived from video sequences and enhances them through a multi-stage signal processing pipeline. These features are modeled using an ensemble of three time series transformer architectures (ETSformer, Temporal Fusion Transformer, and Informer) to capture distinct aspects of temporal dynamics. A shared attention-based refinement module enhances the feature representations, and final predictions are fused using a stackingbased meta-learner. We validate RespFormer on a multimodal dataset comprising synchronized RGB, NIR, and IR video data, including a custom in-house dataset captured under various conditions. Experimental results demonstrate that RespFormer achieves a mean absolute error (MAE) ≈ 0.98 bpm, improving prediction accuracy by ≈ 11% and reducing memory usage by ≈ 26%, while maintaining real-time inference (≈ 1.22 seconds) on resource-constrained devices. Furthermore, RespFormer accurately classifies breathing patterns (normal, bradypnea, tachypnea, and apnea) with 95% accuracy, underscoring it’s potential for practical application in telemedicine, clinical screening, and low-resource healthcare settings. 
    more » « less
  8. Recent advancements in deep learning-based wearable human action recognition (wHAR) have improved the capture and classification of complex motions, but adoption remains limited due to the lack of expert annotations and domain discrepancies from user variations. Limited annotations hinder the model's ability to generalize to out-of-distribution samples. While data augmentation can improve generalizability, unsupervised augmentation techniques must be applied carefully to avoid introducing noise. Unsupervised domain adaptation (UDA) addresses domain discrepancies by aligning conditional distributions with labeled target samples, but vanilla pseudo-labeling can lead to error propagation. To address these challenges, we propose μDAR, a novel joint optimization architecture comprised of three functions: (i) consistency regularizer between augmented samples to improve model classification generalizability, (ii) temporal ensemble for robust pseudo-label generation and (iii) conditional distribution alignment to improve domain generalizability. The temporal ensemble works by aggregating predictions from past epochs to smooth out noisy pseudo-label predictions, which are then used in the conditional distribution alignment module to minimize kernel-based class-wise conditional maximum mean discrepancy (kCMMD) between the source and target feature space to learn a domain invariant embedding. The consistency-regularized augmentations ensure that multiple augmentations of the same sample share the same labels; this results in (a) strong generalization with limited source domain samples and (b) consistent pseudo-label generation in target samples. The novel integration of these three modules in μDAR results in a range of ~ 4-12% average macro-F1 score improvement over six state-of-the-art UDA methods in four benchmark wHAR datasets. 
    more » « less
  9. Non-contact monitoring videos capture subtle respiratory-induced motions, yet existing methods primarily focus on estimating respiratory rate (RR), neglecting the extraction of respiratory waveforms -- a vital parameter that provides critical health information. We formulate video-based RR estimation as a Tracking All Points (TAP) problem and propose a coarse-to-fine, multi-frame Persistent Independent Particle (RRPIPs) framework for robust, multi-modal (RGB, NIR, IR) RR waveform estimation. Addressing the challenge of tracking minute, non-rigid pixel displacements caused by respiratory motions, our top-down approach magnifies respiratory motion using phase-based video magnification tuned to the respiratory frequency range and employs a pretrained RAFT optical flow model for initial region identification via a two-frame analysis. Coarsescale tracking is performed using the RRPIPs model, while a Signal Quality Index (SQI) block evaluates the SNR of trajectories to refine high-respiratory-activity regions. These regions are upsampled, and fine-scale tracking is applied to extract precise waveforms. We curated a large-scale multimodal dataset for respiratory point tracking, combining in-house collected (MPSC-RR) and public datasets, with dense annotations of non-rigid pixel movements across multiple scales in key respiratory regions. Experimental results demonstrate that our framework achieves state-of-the-art accuracy (∼1 MAE) and interpretability in respiratory waveform extraction across RGB, NIR, and IR modalities, effectively addressing multi-scale tracking and low-SNR challenges. Thorough ablation studies validate the contributions of each framework component, and we open-source our codes and dataset to support further research. 
    more » « less