Existing multimodal-based human action recognition approaches are computationally intensive, limiting their deployment in real-time applications. In this work, we present a novel and efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we propose eXpand temporal Shift (X-ShiftNet) convolutional architectures for RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. The X-ShiftNet tackles the high computational cost of the 3D CNNs by integrating the Temporal Shift Module (TSM) into an efficient 2D CNN, enabling efficient spatiotemporal learning. Then skeleton features are utilized to guide the visual network stream, focusing on keyframes and their salient spatial regions using the proposed spatial–temporal attention block. Finally, the predictions of the two streams are fused for final classification. The experimental results show that our method, with a significant reduction in floating-point operations (FLOPs), outperforms and competes with the state-of-the-art methods on NTU RGB-D 60, NTU RGB-D 120, PKU-MMD, and Toyota SmartHome datasets. The proposed EPAM-Net provides up to a 72.8x reduction in FLOPs and up to a 48.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-ActionRecognition.
more »
« less
FACS3D-Net: 3D Convolution based Spatiotemporal Representation for Action Unit Detection
Most approaches to automatic facial action unit (AU) detection consider only spatial information and ignore AU dynamics. For humans, dynamics improves AU perception. Is same true for algorithms? To make use of AU dynamics, recent work in automated AU detection has proposed a sequential spatiotemporal approach: Model spatial information using a 2D CNN and then model temporal information using LSTM (Long-Short-Term Memory). Inspired by the experience of human FACS coders, we hypothesized that combining spatial and temporal information simultaneously would yield more powerful AU detection. To achieve this, we propose FACS3D-Net that simultaneously integrates 3D and 2D CNN. Evaluation was on the Expanded BP4D+ database of 200 participants. FACS3D-Net outperformed both 2D CNN and 2D CNN-LSTM approaches. Visualizations of learnt representations suggest that FACS3D-Net is consistent with the spatiotemporal dynamics attended to by human FACS coders. To the best of our knowledge, this is the first work to apply 3D CNN to the problem of AU detection.
more »
« less
- Award ID(s):
- 1721667
- PAR ID:
- 10168237
- Date Published:
- Journal Name:
- 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII)
- Page Range / eLocation ID:
- 538 to 544
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Advancements in robotics and AI have increased the demand for interactive robots in healthcare and assistive applications. However, ensuring safe and effective physical human-robot interactions (pHRIs) remains challenging due to the complexities of human motor communication and intent recognition. Traditional physics-based models struggle to capture the dynamic nature of human force interactions, limiting robotic adaptability. To address these limitations, neural networks (NNs) have been explored for force-movement intention prediction. While multi-layer perceptron (MLP) networks show potential, they struggle with temporal dependencies and generalization. Long Short-Term Memory (LSTM) networks effectively model sequential dependencies, while Convolutional Neural Networks (CNNs) enhance spatial feature extraction from human force data. Building on these strengths, this study introduces a hybrid LSTM-CNN framework to improve force-movement intention prediction, increasing accuracy from 69% to 86% through effective denoising and advanced architectures. The combined CNN-LSTM network proved particularly effective in handling individualized force-velocity relationships and presents a generalizable model paving the way for more adaptive strategies in robot guidance. These findings highlight the importance of integrating spatial and temporal modeling to enhance robot precision, responsiveness, and human-robot collaboration. Index Terms —- Physical Human-Robot Interaction, Intention Detection, Machine Learning, Long-Short Term Memory (LSTM)more » « less
-
While machine learning approaches are rapidly being applied to hydrologic problems, physics-informed approaches are still relatively rare. Many successful deep-learning applications have focused on point estimates of streamflow trained on stream gauge observations over time. While these approaches show promise for some applications, there is a need for distributed approaches that can produce accurate two-dimensional results of model states, such as ponded water depth. Here, we demonstrate a 2D emulator of the Tilted V catchment benchmark problem with solutions provided by the integrated hydrology model ParFlow. This emulator model can use 2D Convolution Neural Network (CNN), 3D CNN, and U-Net machine learning architectures and produces time-dependent spatial maps of ponded water depth from which hydrographs and other hydrologic quantities of interest may be derived. A comparison of different deep learning architectures and hyperparameters is presented with particular focus on approaches such as 3D CNN (that have a time-dependent learning component) and 2D CNN and U-Net approaches (that use only the current model state to predict the next state in time). In addition to testing model performance, we also use a simplified simulation based inference approach to evaluate the ability to calibrate the emulator to randomly selected simulations and the match between ML calibrated input parameters and underlying physics-based simulation.more » « less
-
Abstract Objective . Neural decoding is an important tool in neural engineering and neural data analysis. Of various machine learning algorithms adopted for neural decoding, the recently introduced deep learning is promising to excel. Therefore, we sought to apply deep learning to decode movement trajectories from the activity of motor cortical neurons. Approach . In this paper, we assessed the performance of deep learning methods in three different decoding schemes, concurrent, time-delay, and spatiotemporal. In the concurrent decoding scheme where the input to the network is the neural activity coincidental to the movement, deep learning networks including artificial neural network (ANN) and long-short term memory (LSTM) were applied to decode movement and compared with traditional machine learning algorithms. Both ANN and LSTM were further evaluated in the time-delay decoding scheme in which temporal delays are allowed between neural signals and movements. Lastly, in the spatiotemporal decoding scheme, we trained convolutional neural network (CNN) to extract movement information from images representing the spatial arrangement of neurons, their activity, and connectomes (i.e. the relative strengths of connectivity between neurons) and combined CNN and ANN to develop a hybrid spatiotemporal network. To reveal the input features of the CNN in the hybrid network that deep learning discovered for movement decoding, we performed a sensitivity analysis and identified specific regions in the spatial domain. Main results . Deep learning networks (ANN and LSTM) outperformed traditional machine learning algorithms in the concurrent decoding scheme. The results of ANN and LSTM in the time-delay decoding scheme showed that including neural data from time points preceding movement enabled decoders to perform more robustly when the temporal relationship between the neural activity and movement dynamically changes over time. In the spatiotemporal decoding scheme, the hybrid spatiotemporal network containing the concurrent ANN decoder outperformed single-network concurrent decoders. Significance . Taken together, our study demonstrates that deep learning could become a robust and effective method for the neural decoding of behavior.more » « less
-
Deep learning (DL) models have been used for rapid assessments of environmental phenomena like mapping compound flood hazards from cyclones. However, predicting compound flood dynamics (e.g., flood extent and inundation depth over time) is often done with physically-based models because they capture physical drivers, nonlinear interactions, and hysteresis in system behavior. Here, we show that a customized DL model can efficiently learn spatiotemporal dependencies of multiple flood events in Galveston, TX. The proposed model combines the spatial feature extraction of CNN, temporal regression of LSTM, and a novel cluster-based temporal attention approach to assimilate multimodal inputs; thus, accurately replicating compound flood dynamics of physically-based models. The DL model achieves satisfactory flood timing (±1 h), critical success index above 60 %, RMSE below 0.10 m, and nearly perfect error bias of 1. These results demonstrate the model's potential to assist in flood preparation and response efforts in vulnerable coastal regions.more » « less
An official website of the United States government

