Networked data involve complex information from multifaceted channels, including topology structures, node content, and/or node labels etc., where structure and content are often correlated but are not always consistent. A typical scenario is the citation relationships in scholarly publications where a paper is cited by others not because they have the same content, but because they share one or multiple subject matters. To date, while many network embedding methods exist to take the node content into consideration, they all consider node content as simple flat word/attribute set and nodes sharing connections are assumed to have dependency with respect to allmore »
A Stochastic Attribute Grammar for Robust Cross-View Human Tracking
In computer vision, tracking humans across camera views
remains challenging, especially for complex scenarios with frequent
occlusions, significant lighting changes and other difficulties. Under
such conditions, most existing appearance and geometric cues are
not reliable enough to distinguish humans across camera views. To
address these challenges, this paper presents a stochastic attribute
grammar model for leveraging complementary and discriminative human
attributes for enhancing cross-view tracking. The key idea of our
method is to introduce a hierarchical representation, parse graph, to
describe a subject and its movement trajectory in both space and time
domains. This results in a hierarchical compositional representation,
comprising trajectory entities of varying level, including human boxes,
3D human boxes, tracklets and trajectories. We use a set of grammar
rules to decompose a graph node (e.g. tracklet) into a set of children
nodes (e.g. 3D human boxes), and augment each node with a set
of attributes, including geometry (e.g., moving speed, direction), accessories
(e.g., bags), and/or activities (e.g., walking, running). These
attributes serve as valuable cues, in addition to appearance features
(e.g., colors), in determining the associations of human detection boxes
across cameras. In particular, the attributes of a parent node are inherited
by its children nodes, resulting in consistency constraints over
the feasible parse graph. Thus, we cast cross-view human tracking as
finding the most discriminative parse graph for more »
- Award ID(s):
- 1657600
- Publication Date:
- NSF-PAR ID:
- 10056964
- Journal Name:
- IEEE transactions on circuits and systems for video technology
- ISSN:
- 1558-2205
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Attributed network embedding aims to learn low dimensional node representations by combining both the network's topological structure and node attributes. Most of the existing methods either propagate the attributes over the network structure or learn the node representations by an encoder-decoder framework. However, propagation based methods tend to prefer network structure to node attributes, whereas encoder-decoder methods tend to ignore the longer connections beyond the immediate neighbors. In order to address these limitations while enjoying the best of the two worlds, we design cross fusion layers for unsupervised attributed network embedding. Specifically, we first construct two separate views to handlemore »
-
This paper first proposes a method of formulating model interpretability in visual understanding tasks based on the idea of unfolding latent structures. It then presents a case study in object detection using popular two-stage region-based convolutional neural network (i.e., R-CNN) detection systems. The proposed method focuses on weakly-supervised extractive rationale generation, that is learning to unfold latent discriminative part configurations of object instances automatically and simultaneously in detection without using any supervision for part configurations. It utilizes a top-down hierarchical and compositional grammar model embedded in a directed acyclic AND-OR Graph (AOG) to explore and unfold the space of latentmore »
-
In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation. Our model directly takes 2D pose as input and learns a generalized 2D-3D mapping function. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNN) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a pose sample simulator to augment training samples in virtual camera views, which further improvesmore »
-
Tracking humans that are interacting with the other subjects or environment remains unsolved in visual tracking, because the visibility of the human of interests in videos is unknown and might vary over time. In particular, it is still difficult for state-of-the-art human trackers to recover completely human trajectories in crowded scenes with frequent human interactions. In this work, we consider the visibility status of a subject as a fluent variable, whose change is mostly attributed to the subject’s interaction with the surrounding, e.g., crossing behind another object, entering a a building, or getting into a vehicle, etc. We introduce amore »