skip to main content


Title: MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing
Video-language models (VLMs), large models pre-trained on numerous but noisy video-text pairs from the internet, have revolutionized activity recognition through their remarkable generalization and open-vocabulary capabilities. While complex human activities are often hierarchical and compositional, most existing tasks for evaluating VLMs focus only on high-level video understanding, making it difficult to accurately assess and interpret the ability of VLMs to understand complex and fine-grained human activities. Inspired by the recently proposed MOMA framework, we define activity graphs as a single universal representation of human activities that encompasses video understanding at the activity, sub10 activity, and atomic action level. We redefine activity parsing as the overarching task of activity graph generation, requiring understanding human activities across all three levels. To facilitate the evaluation of models on activity parsing, we introduce MOMA-LRG (Multi-Object Multi-Actor Language-Refined Graphs), a large dataset of complex human activities with activity graph annotations that can be readily transformed into natural language sentences. Lastly, we present a model-agnostic and lightweight approach to adapting and evaluating VLMs by incorporating structured knowledge from activity graphs into VLMs, addressing the individual limitations of language and graphical models. We demonstrate a strong performance on activity parsing and few-shot video classification, and our framework is intended to foster future research in the joint modeling of videos, graphs, and language.  more » « less
Award ID(s):
2026498
NSF-PAR ID:
10358719
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In computer vision, tracking humans across camera views remains challenging, especially for complex scenarios with frequent occlusions, significant lighting changes and other difficulties. Under such conditions, most existing appearance and geometric cues are not reliable enough to distinguish humans across camera views. To address these challenges, this paper presents a stochastic attribute grammar model for leveraging complementary and discriminative human attributes for enhancing cross-view tracking. The key idea of our method is to introduce a hierarchical representation, parse graph, to describe a subject and its movement trajectory in both space and time domains. This results in a hierarchical compositional representation, comprising trajectory entities of varying level, including human boxes, 3D human boxes, tracklets and trajectories. We use a set of grammar rules to decompose a graph node (e.g. tracklet) into a set of children nodes (e.g. 3D human boxes), and augment each node with a set of attributes, including geometry (e.g., moving speed, direction), accessories (e.g., bags), and/or activities (e.g., walking, running). These attributes serve as valuable cues, in addition to appearance features (e.g., colors), in determining the associations of human detection boxes across cameras. In particular, the attributes of a parent node are inherited by its children nodes, resulting in consistency constraints over the feasible parse graph. Thus, we cast cross-view human tracking as finding the most discriminative parse graph for each subject in videos. We develop a learning method to train this attribute grammar model from weakly supervised training data. To infer the optimal parse graph and its attributes, we develop an alternative parsing method that employs both top-down and bottom-up computations to search the optimal solution. We also explicitly reason the occlusion status of each entity in order to deal with significant changes of camera viewpoints. We evaluate the proposed method over public video benchmarks and demonstrate with extensive experiments that our method clearly outperforms state-of-theart tracking methods. 
    more » « less
  2. Abstract—Summarization of long sequences into a concise statement is a core problem in natural language processing, which requires a non-trivial understanding of the weakly structured text. Therefore, integrating crowdsourced multiple users’ comments into a concise summary is even harder because (1) it requires transferring the weakly structured comments to structured knowledge. Besides, (2) the users comments are informal and noisy. In order to capture the long-distance relationships in staggered long sentences, we propose a neural multi-comment summarization (MCS) system that incorporates the sentence relationships via graph heuristics that utilize relation knowledge graphs, i.e., sentence relation graphs (SRG) and approximate discourse graphs (ADG). Motivated by the promising results of gated graph neural networks (GG-NNs) on highly structured data, we develop a GG-NNs with sequence encoder that incorporates SRG or ADG in order to capture the sentence relationships. Specifically, we employ the GG-NNs on both relation knowledge graphs, with the sentence embeddings as the input node features and the graph heuristics as the edges’ weights. Through multiple layerwise propagations, the GG-NNs generate the salience for each sentence from high-level hidden sentence features. Consequently, we use a greedy heuristic to extract salient users’ comments while avoiding the noise in comments. The experimental results show that the proposed MCS improves the summarization performance both quantitatively and qualitatively. 
    more » « less
  3. Discovering the underlying structures present in large real world graphs is a fundamental scientific problem. Recent work at the intersection of formal language theory and graph theory has found that a Probabilistic Hyperedge Replacement Grammar (PHRG) can be extracted from a tree decomposition of any graph. However, because the extracted PHRG is directly dependent on the shape and contents of the tree decomposition, rather than from the dynamics of the graph, it is unlikely that informative graph-processes are actually being captured with the PHRG extraction algorithm. To address this problem, the current work adapts a related formalism called Probabilistic Synchronous HRG (PSHRG) that learns synchronous graph production rules from temporal graphs. We introduce the PSHRG model and describe a method to extract growth rules from the graph. We find that SHRG rules capture growth patterns found in temporal graphs and can be used to predict the future evolution of a temporal graph. We perform a brief evaluation on small synthetic networks that demonstrate the prediction accuracy of PSHRG versus baseline and state of the art models. Ultimately, we find that PSHRGs seem to be very good at modelling dynamics of a temporal graph; however, our prediction algorithm, which is based on string parsing and generation algorithms, does not scale to practically useful graph sizes. 
    more » « less
  4. This study aims at sensing and understanding the worker’s activity in a human-centered intelligent manufacturing system. We propose a novel multi-modal approach for worker activity recognition by leveraging information from different sensors and in different modalities. Specifically, a smart armband and a visual camera are applied to capture Inertial Measurement Unit (IMU) signals and videos, respectively. For the IMU signals, we design two novel feature transform mechanisms, in both frequency and spatial domains, to assemble the captured IMU signals as images, which allow using convolutional neural networks to learn the most discriminative features. Along with the above two modalities, we propose two other modalities for the video data, i.e., at the video frame and video clip levels. Each of the four modalities returns a probability distribution on activity prediction. Then, these probability distributions are fused to output the worker activity classification result. A worker activity dataset is established, which at present contains 6 common activities in assembly tasks, i.e., grab a tool/part, hammer a nail, use a power-screwdriver, rest arms, turn a screwdriver, and use a wrench. The developed multi-modal approach is evaluated on this dataset and achieves recognition accuracies as high as 97% and 100% in the leave-one-out and half-half experiments, respectively. 
    more » « less
  5. Knowledge tracing (KT), or modeling student knowledge state given their past activity sequence, is one of the essential tasks in online education systems. Research has demonstrated that students benefit from both assessed (e.g., solving problems, which can be graded) and non-assessed learning activities (e.g., watching video lectures, which cannot be graded), and thus, modeling student knowledge from multiple types of activities with knowledge transfer between them is crucial. However, current approaches to multi-activity knowledge tracing cannot capture coarse-grained between-type associations and are primarily evaluated by predicting student performance on upcoming assessed activities (labeled data). Therefore, they are inadequate in incorporating signals from non-assessed activities (unlabeled data). We propose Graph-enhanced Multi-activity Knowledge Tracing (GMKT) that addresses these challenges by jointly learning a fine-grained recurrent memory-augmented student knowledge model and a coarse-grained graph neural network. In GMKT, we formulate multi-activity knowledge tracing as a semi-supervised sequence learning problem and optimize for accurate student performance and activity type at each time step. We demonstrate the effectiveness of our proposed model by experimenting on three real-world datasets. 
    more » « less