Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
In computer vision, tracking humans across camera views remains challenging, especially for complex scenarios with frequent occlusions, significant lighting changes and other difficulties. Under such conditions, most existing appearance and geometric cues are not reliable enough to distinguish humans across camera views. To address these challenges, this paper presents a stochastic attribute grammar model for leveraging complementary and discriminative human attributes for enhancing cross-view tracking. The key idea of our method is to introduce a hierarchical representation, parse graph, to describe a subject and its movement trajectory in both space and time domains. This results in a hierarchical compositional representation, comprising trajectory entities of varying level, including human boxes, 3D human boxes, tracklets and trajectories. We use a set of grammar rules to decompose a graph node (e.g. tracklet) into a set of children nodes (e.g. 3D human boxes), and augment each node with a set of attributes, including geometry (e.g., moving speed, direction), accessories (e.g., bags), and/or activities (e.g., walking, running). These attributes serve as valuable cues, in addition to appearance features (e.g., colors), in determining the associations of human detection boxes across cameras. In particular, the attributes of a parent node are inherited by its children nodes, resulting in consistency constraints over the feasible parse graph. Thus, we cast cross-view human tracking as finding the most discriminative parse graph for each subject in videos. We develop a learning method to train this attribute grammar model from weakly supervised training data. To infer the optimal parse graph and its attributes, we develop an alternative parsing method that employs both top-down and bottom-up computations to search the optimal solution. We also explicitly reason the occlusion status of each entity in order to deal with significant changes of camera viewpoints. We evaluate the proposed method over public video benchmarks and demonstrate with extensive experiments that our method clearly outperforms state-of-theart tracking methods.more » « less
Tracking humans that are interacting with the other subjects or environment remains unsolved in visual tracking, because the visibility of the human of interests in videos is unknown and might vary over time. In particular, it is still difficult for state-of-the-art human trackers to recover completely human trajectories in crowded scenes with frequent human interactions. In this work, we consider the visibility status of a subject as a fluent variable, whose change is mostly attributed to the subject’s interaction with the surrounding, e.g., crossing behind another object, entering a a building, or getting into a vehicle, etc. We introduce a Causal And-Or Graph (C-AOG) to represent the causal effect relations between an object’s visibility fluent and its activities, and develop a probabilistic graph model to jointly reason the visibility fluent change (e.g., from visible to invisible) and track humans in videos. We formulate this joint task as an iterative search of a feasible causal graph structure that enables fast search algorithm, e.g., dynamic programming method. We apply the proposed method to challenging video sequences to evaluate its capabilities of estimating visibility fluent changes of subjects and tracking subjects of interests over time. Results with comparisons demonstrate that our method outperforms the alternative trackers and can recover complete trajectories of humans in complicated scenarios with frequent human interactionsmore » « less
In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation. Our model directly takes 2D pose as input and learns a generalized 2D-3D mapping function. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a pose sample simulator to augment training samples in virtual camera views, which further improves our model generalizability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods.We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.more » « less
This paper presents a unified framework to learn to quantify perceptual attributes (e.g., safety, attractiveness) of physical urban environments using crowd-sourced street-view photos without human annotations. The efforts of this work include two folds. First, we collect a large-scale urban image dataset in multiple major cities in U.S.A., which consists of multiple street-view photos for every place. Instead of using subjective annotations as in previous works, which are neither accurate nor consistent, we collect for every place the safety score from government’s crime event records as objective safety indicators. Second, we observe that the place-centric perception task is by nature a multi-instance regression problem since the labels are only available for places (bags), rather than images or image regions (instances). We thus introduce a deep convolutional neural network (CNN) to parameterize the instance-level scoring function, and develop an EM algorithm to alternatively estimate the primary instances (images or image regions) which affect the safety scores and train the proposed network. Our method is capable of localizing interesting images and image regions for each place.We evaluate the proposed method on a newly created dataset and a public dataset. Results with comparisons showed that our method can clearly outperform the alternative perception methods and more importantly, is capable of generating region-level safety scores to facilitate interpretations of the perception process.more » « less