skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects
Tracking humans that are interacting with the other subjects or environment remains unsolved in visual tracking, because the visibility of the human of interests in videos is unknown and might vary over time. In particular, it is still difficult for state-of-the-art human trackers to recover completely human trajectories in crowded scenes with frequent human interactions. In this work, we consider the visibility status of a subject as a fluent variable, whose change is mostly attributed to the subject’s interaction with the surrounding, e.g., crossing behind another object, entering a a building, or getting into a vehicle, etc. We introduce a Causal And-Or Graph (C-AOG) to represent the causal effect relations between an object’s visibility fluent and its activities, and develop a probabilistic graph model to jointly reason the visibility fluent change (e.g., from visible to invisible) and track humans in videos. We formulate this joint task as an iterative search of a feasible causal graph structure that enables fast search algorithm, e.g., dynamic programming method. We apply the proposed method to challenging video sequences to evaluate its capabilities of estimating visibility fluent changes of subjects and tracking subjects of interests over time. Results with comparisons demonstrate that our method outperforms the alternative trackers and can recover complete trajectories of humans in complicated scenarios with frequent human interactions  more » « less
Award ID(s):
1657600
PAR ID:
10056960
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
IEEE Conference on Computer Vision and Pattern Recognition
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In computer vision, tracking humans across camera views remains challenging, especially for complex scenarios with frequent occlusions, significant lighting changes and other difficulties. Under such conditions, most existing appearance and geometric cues are not reliable enough to distinguish humans across camera views. To address these challenges, this paper presents a stochastic attribute grammar model for leveraging complementary and discriminative human attributes for enhancing cross-view tracking. The key idea of our method is to introduce a hierarchical representation, parse graph, to describe a subject and its movement trajectory in both space and time domains. This results in a hierarchical compositional representation, comprising trajectory entities of varying level, including human boxes, 3D human boxes, tracklets and trajectories. We use a set of grammar rules to decompose a graph node (e.g. tracklet) into a set of children nodes (e.g. 3D human boxes), and augment each node with a set of attributes, including geometry (e.g., moving speed, direction), accessories (e.g., bags), and/or activities (e.g., walking, running). These attributes serve as valuable cues, in addition to appearance features (e.g., colors), in determining the associations of human detection boxes across cameras. In particular, the attributes of a parent node are inherited by its children nodes, resulting in consistency constraints over the feasible parse graph. Thus, we cast cross-view human tracking as finding the most discriminative parse graph for each subject in videos. We develop a learning method to train this attribute grammar model from weakly supervised training data. To infer the optimal parse graph and its attributes, we develop an alternative parsing method that employs both top-down and bottom-up computations to search the optimal solution. We also explicitly reason the occlusion status of each entity in order to deal with significant changes of camera viewpoints. We evaluate the proposed method over public video benchmarks and demonstrate with extensive experiments that our method clearly outperforms state-of-theart tracking methods. 
    more » « less
  2. In this work, we revisit the classical stochastic jump-diffusion process and develop an effective variant for estimating visibility statuses of objects while tracking them in videos. Dealing with partial or full occlusions is a long standing problem in computer vision but largely remains unsolved. In this work, we cast the above problem as a Markov Decision Process and develop a policy-based jump-diffusion method to jointly track object locations in videos and estimate their visibility statuses. Our method employs a set of jump dynamics to change object’s visibility statuses and a set of diffusion dynamics to track objects in videos. Different from traditional jump-diffusion process that stochastically generates dynamics, we utilize deep policy functions to determine the best dynamic for the present state and learn the optimal policies using reinforcement learning methods. Our method is capable of tracking objects with full or partial occlusions in crowded scenes. We evaluate the proposed method over challenging video sequences and compare it to alternative tracking methods. Significant improvements are made particularly for videos with frequent interactions or occlusions. 
    more » « less
  3. Fluency---described as the ``coordinated meshing of joint activities between members of a well-synchronized team''---is essential to human-robot team success. Human teams achieve fluency through rich, often mostly implicit, communication. A key challenge in bridging the gap between industry and academia is understanding what influences human perception of a fluent team experience to better optimize human-robot fluency in industrial environments. This paper addresses this challenge by developing an online experiment featuring videos that vary the timing of human and robot actions to influence perceived team fluency. Our results support three broad conclusions. First, we did not see differences across most subjective fluency measures. Second, people report interactions as more fluent as teammates stay more active. Third, reducing delays when humans' tasks depend on robots increases perceived team fluency. 
    more » « less
  4. Abstract Global climate change has been shown to cause longer, more intense, and frequent heatwaves, of which anthropogenic stressors concentrated in urban areas are a critical contributor. In this study, we investigate the causal interactions during heatwaves across 520 urban sites in the U.S. combining complex network and causal analysis. The presence of regional mediators is manifest in the constructed causal networks, together with long-range teleconnections. More importantly, megacities, such as New York City and Chicago, are causally connected with most of other cities and mediate the structure of urban networks during heatwaves. We also identified a significantly positive correlation between the causality strength and the total populations in megacities. These findings corroborate the contribution of human activities e.g., anthropogenic emissions of greenhouse gases or waste heat, to urban heatwaves. The emergence of teleconnections and supernodes are informative for the prediction and adaptation to heatwaves under global climate change. 
    more » « less
  5. With the increasing reliance on small Unmanned Aerial Systems (sUAS) for Emergency Response Scenarios, such as Search and Rescue, the integration of computer vision capabilities has become a key factor in mission success. Nevertheless, computer vision performance for detecting humans severely degrades when shifting from ground to aerial views. Several aerial datasets have been created to mitigate this problem, however, none of them has specifically addressed the issue of occlusion, a critical component in Emergency Response Scenarios. Natural, Occluded, Multi-scale Aerial Dataset (NOMAD) presents a benchmark for human detection under occluded aerial views, with five different aerial distances and rich imagery variance. NOMAD is composed of 100 different Actors, all performing sequences of walking, laying and hiding. It includes 42,825 frames, extracted from 5.4k resolution videos, and manually annotated with a bounding box and a label describing 10 different visibility levels, categorized according to the percentage of the human body visible inside the bounding box. This allows computer vision models to be evaluated on their detection performance across different ranges of occlusion. NOMAD is designed to improve the effectiveness of aerial search and rescue and to enhance collaboration between sUAS and humans, by providing a new benchmark dataset for human detection under occluded aerial views. 
    more » « less