Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
This paper introduces a novel transformer network tailored to skeleton-based action detection in untrimmed long video streams. Our approach centers around three innovative mechanisms that collectively enhance the network’s temporal analysis capabilities. First, a new predictive attention mechanism incorporates future frame data into the sequence analysis during the training phase. This mechanism addresses the essential issue of the current action detection models: incomplete temporal modeling in long action sequences, particularly for boundary frames that lie outside the network’s immediate temporal receptive field, while maintaining computational efficiency. Second, we integrate a new adaptive weighted temporal attention system that dynamically evaluates the importance of each frame within an action sequence. In contrast to the existing approaches, the proposed weighting strategy is both adaptive and interpretable, making it highly effective in handling long sequences with numerous non-informative frames. Third, the network incorporates an advanced regression technique. This approach independently identifies the start and end frames based on their relevance to different frames. Unlike existing homogeneous regression methods, the proposed regression method is heterogeneous and based on various temporal relationships, including those in future frames in actions, making it more effective for action detection. Extensive experiments on prominent untrimmed skeleton-based action datasets, PKU-MMD, OAD, and the Charade dataset demonstrate the effectiveness of this network.more » « lessFree, publicly-accessible full text available October 1, 2026
-
Student attentiveness within the classroom can be assessed by observing student attention toward the teacher or whiteboard, which may be inferred through eye-gaze direction. This paper introduces a novel technique for evaluating student attentiveness by analyzing the direction of their eye gaze derived from their 3D skeletal pose in a reconstructed 3D environment. As for the contributions, the paper suggests a novel 3D head pose estimation algorithm that, unlike other works, does not need frontal face information. As a result, the method is highly effective in uncontrolled environments such as classrooms, where frontal face data is often unavailable. Moreover, a new algorithm was developed to evaluate student attentiveness based on 3D eye gaze information interpreted from the 3D head pose. The proposed method has been validated using a set of instructional videos collected at the University of Virginia.more » « less
-
Hwang, Gwo-Jen; Xie, Haoran; Wah, Benjamin; Gasevic, Dragan (Ed.)Classroom videos are a common source of data for educational researchers studying classroom interactions as well as a resource for teacher education and professional development. Over the last several decades emerging technologies have been applied to classroom videos to record, transcribe, and analyze classroom interactions. With the rise of machine learning, we report on the development and validation of neural networks to classify instructional activities using video signals, without analyzing speech or audio features, from a large corpus of nearly 250 h of classroom videos from elementary mathematics and English language arts instruction. Results indicated that the neural networks performed fairly-well in detecting instructional activities, at diverse levels of complexity, as compared to human raters. For instance, one neural network achieved over 80% accuracy in detecting four common activity types: whole class activity, small group activity, individual activity, and transition. An issue that was not addressed in this study was whether the fine-grained and agnostic instructional activities detected by the neural networks could scale up to supply information about features of instructional quality. Future applications of these neural networks may enable more efficient cataloguing and analysis of classroom videos at scale and the generation of fine-grained data about the classroom environment to inform potential implications for teaching and learning.more » « less
-
Korban, Matthew; Acton, Scott T; Youngs, Peter; Foster, Jonathan (Ed.)Instructional activity recognition is an analytical tool for the observation of classroom education. One of the primary challenges in this domain is dealing with the intri- cate and heterogeneous interactions between teachers, students, and instructional objects. To address these complex dynamics, we present an innovative activity recognition pipeline designed explicitly for instructional videos, leveraging a multi-semantic attention mechanism. Our novel pipeline uses a transformer network that incorporates several types of instructional seman- tic attention, including teacher-to-students, students-to-students, teacher-to-object, and students-to-object relationships. This com- prehensive approach allows us to classify various interactive activity labels effectively. The effectiveness of our proposed algo- rithm is demonstrated through its evaluation on our annotated instructional activity dataset.more » « less
-
Lee, Kyoung Mu (Ed.)This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model the spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens.more » « less
-
Korban, Matthew; Youngs, Peter; Acton, Scott T (Ed.)Analyzing instructional videos via computer vision and machine learning holds promise for several tasks, such as assessing teacher performance and classroom climate, evaluating student engagement, and identifying racial bias in instruction. The traditional way of evaluating instructional videos depends on manual observation with human raters, which is time-consuming and requires a trained labor force. Therefore, this paper tests several deep network architectures in the automation of instruc- tional video analysis, where the networks are tailored to recognize classroom activity. Our experimental setup includes a set of 250 hours of primary and middle school videos that are annotated by expert human raters. We present several strategies to handle varying length of instructional activities, a major challenge in the detection of instructional activity. Based on the proposed strategies, we enhance and compare different deep networks for detecting instructional activity.more » « less
An official website of the United States government
