skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A multi-modal transformer network for action detection.
This paper proposes a multi-modal transformer network for detecting actions in untrimmed videos. To enrich the action features, our transformer network utilizes a novel multi-modal attention mechanism that captures the correlations between different combinations of spa- tial and motion modalities. Exploring such correlations for actions effectively has not been explored before. We also suggest an algorithm to correct the motion distortion caused by camera movements. Such motion distortion severely reduces the expressive power of motion features represented by optical flow vectors. We also introduce a new instructional activity dataset that includes classroom videos from K-12 schools. We conduct comprehensive ex- periments to evaluate the performance of different approaches on our dataset. Our proposed algorithm outperforms the state-of-the-art methods on two public benchmarks, THUMOS14 and ActivityNet, and our instructional activity dataset.  more » « less
Award ID(s):
2000487
PAR ID:
10448473
Author(s) / Creator(s):
; ;
Editor(s):
Hancock, E.
Date Published:
Journal Name:
Pattern recognition
Volume:
142
ISSN:
0031-3203
Page Range / eLocation ID:
1-28
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Lee, Kyoung Mu (Ed.)
    This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model the spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens. 
    more » « less
  2. Korban, Matthew; Acton, Scott T; Youngs, Peter; Foster, Jonathan (Ed.)
    Instructional activity recognition is an analytical tool for the observation of classroom education. One of the primary challenges in this domain is dealing with the intri- cate and heterogeneous interactions between teachers, students, and instructional objects. To address these complex dynamics, we present an innovative activity recognition pipeline designed explicitly for instructional videos, leveraging a multi-semantic attention mechanism. Our novel pipeline uses a transformer network that incorporates several types of instructional seman- tic attention, including teacher-to-students, students-to-students, teacher-to-object, and students-to-object relationships. This com- prehensive approach allows us to classify various interactive activity labels effectively. The effectiveness of our proposed algo- rithm is demonstrated through its evaluation on our annotated instructional activity dataset. 
    more » « less
  3. null (Ed.)
    First-person-view videos of hands interacting with tools are widely used in the computer vision industry. However, creating a dataset with pixel-wise segmentation of hands is challenging since most videos are captured with fingertips occluded by the hand dorsum and grasped tools. Current methods often rely on manually segmenting hands to create annotations, which is inefficient and costly. To relieve this challenge, we create a method that utilizes thermal information of hands for efficient pixel-wise hand segmentation to create a multi-modal activity video dataset. Our method is not affected by fingertip and joint occlusions and does not require hand pose ground truth. We show our method to be 24 times faster than the traditional polygon labeling method while maintaining high quality. With the segmentation method, we propose a multi-modal hand activity video dataset with 790 sequences and 401,765 frames of "hands using tools" videos captured by thermal and RGB-D cameras with hand segmentation data. We analyze multiple models for hand segmentation performance and benchmark four segmentation networks. We show that our multi-modal dataset with fusing Long-Wave InfraRed (LWIR) and RGB-D frames achieves 5% better hand IoU performance than using RGB frames. 
    more » « less
  4. Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies. 
    more » « less
  5. In this paper, we propose a machine learning-based multi-stream framework to recognize American Sign Language (ASL) manual signs and nonmanual gestures (face and head movements) in real time from RGB-D videos. Our approach is based on 3D Convolutional Neural Networks (3D CNNs) by fusing the multi-modal features including hand gestures, facial expressions, and body poses from multiple channels (RGB, Depth, Motion, and Skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3D CNN model. We collected a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera. Each video consists of 100 ASL manual signs, along with RGB channel, Depth maps, Skeleton joints, Face features, and HD face. The dataset is fully annotated for each semantic region (i.e. the time duration of each sign that the human signer performs). Our proposed method achieves 92.88% accuracy for recognizing 100 ASL sign glosses in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on a large-scale dataset, ChaLearn IsoGD, achieving the state-of-the-art results. 
    more » « less