skip to main content

Title: Joint 3D Human Shape Recovery and Pose Estimation from a Single Image with Bilayer Graph
The ability to estimate the 3D human shape and pose from images can be useful in many contexts. Recent approaches have explored using graph convolutional networks and achieved promising results. The fact that the 3D shape is represented by a mesh, an undirected graph, makes graph convolutional networks a natural fit for this problem. However, graph convolutional networks have limited representation power Information from nodes in the graph is passed to connected neighbors, and propagation of information requires successive graph convolutions. To overcome this limitation, we propose a dual-scale graph approach. We use a coarse graph, derived from a dense graph, to estimate the human’s 3D pose, and the dense graph to estimate the 3D shape. Information in coarse graphs can be propagated over longer distances compared to dense graphs. In addition, information about pose can guide to recover local shape detail and vice versa. We recognize that the connection between coarse and dense is itself a graph, and introduce graph fusion blocks to exchange information between graphs with different scales. We train our model end-to-end and show that we can achieve state-of-the-art results for several evaluation datasets. The code is available at the following link,
; ;
Award ID(s):
Publication Date:
Journal Name:
2021 International Conference on 3D Vision (3DV)
Page Range or eLocation-ID:
505 to 514
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract We define a novel type of ensemble graph convolutional network (GCN) model. Using optimized linear projection operators to map between spatial scales of graph, this ensemble model learns to aggregate information from each scale for its final prediction. We calculate these linear projection operators as the infima of an objective function relating the structure matrices used for each GCN. Equipped with these projections, our model (a Graph Prolongation-Convolutional Network) outperforms other GCN ensemble models at predicting the potential energy of monomer subunits in a coarse-grained mechanochemical simulation of microtubule bending. We demonstrate these performance gains by measuring an estimate of the Floating Point OPerations spent to train each model, as well as wall-clock time. Because our model learns at multiple scales, it is possible to train at each scale according to a predetermined schedule of coarse vs. fine training. We examine several such schedules adapted from the algebraic multigrid literature, and quantify the computational benefit of each. We also compare this model to another model which features an optimized coarsening of the input graph. Finally, we derive backpropagation rules for the input of our network model with respect to its output, and discuss how our method may be extendedmore »to very large graphs.« less
  2. Learning the low-dimensional representations of graphs (i.e., network embedding) plays a critical role in network analysis and facilitates many downstream tasks. Recently graph convolutional networks (GCNs) have revolutionized the field of network embedding, and led to state-of-the-art performance in network analysis tasks such as link prediction and node classification. Nevertheless, most of the existing GCN-based network embedding methods are proposed for unsigned networks. However, in the real world, some of the networks are signed, where the links are annotated with different polarities, e.g., positive vs. negative. Since negative links may have different properties from the positive ones and can also significantly affect the quality of network embedding. Thus in this paper, we propose a novel network embedding framework SNEA to learn Signed Network Embedding via graph Attention. In particular, we propose a masked self-attentional layer, which leverages self-attention mechanism to estimate the importance coefficient for pair of nodes connected by different type of links during the embedding aggregation process. Then SNEA utilizes the masked self-attentional layers to aggregate more important information from neighboring nodes to generate the node embeddings based on balance theory. Experimental results demonstrate the effectiveness of the proposed framework through signed link prediction task on several real-worldmore »signed network datasets.« less
  3. 3D object recognition accuracy can be improved by learning the multi-scale spatial features from 3D spatial geometric representations of objects such as point clouds, 3D models, surfaces, and RGB-D data. Current deep learning approaches learn such features either using structured data representations (voxel grids and octrees) or from unstructured representations (graphs and point clouds). Learning features from such structured representations is limited by the restriction on resolution and tree depth while unstructured representations creates a challenge due to non-uniformity among data samples. In this paper, we propose an end-to-end multi-level learning approach on a multi-level voxel grid to overcome these drawbacks. To demonstrate the utility of the proposed multi-level learning, we use a multi-level voxel representation of 3D objects to perform object recognition. The multi-level voxel representation consists of a coarse voxel grid that contains volumetric information of the 3D object. In addition, each voxel in the coarse grid that contains a portion of the object boundary is subdivided into multiple fine-level voxel grids. The performance of our multi-level learning algorithm for object recognition is comparable to dense voxel representations while using significantly lower memory.
  4. In this paper, we propose a supervised graph representation learning method to model the relationship between brain functional connectivity (FC) and structural connectivity (SC) through a graph encoder-decoder system. The graph convolutional network (GCN) model is leveraged in the encoder to learn lower-dimensional node representations (i.e. node embeddings) integrating information from both node attributes and network topology. In doing so, the encoder manages to capture both direct and indirect interactions between brain regions in the node embeddings which later help reconstruct empirical FC networks. From node embeddings, graph representations are learnt to embed the entire graphs into a vector space. Our end-to-end model utilizes a multi-objective loss function to simultaneously learn node representations for FC network reconstruction and graph representations for subject classification. The experiment on a large population of non-drinkers and heavy drinkers shows that our model can provide a characterization of the population pattern in the SC-FC relationship, while also learning features that capture individual uniqueness for subject classification. The identified key brain subnetworks show significant between-group difference and support the promising prospect of GCN-based graph representation learning on brain networks to model human brain activity and function.
  5. Event-based cameras have been designed for scene motion perception - their high temporal resolution and spatial data sparsity converts the scene into a volume of boundary trajectories and allows to track and analyze the evolution of the scene in time. Analyzing this data is computationally expensive, and there is substantial lack of theory on dense-in-time object motion to guide the development of new algorithms; hence, many works resort to a simple solution of discretizing the event stream and converting it to classical pixel maps, which allows for application of conventional image processing methods. In this work we present a Graph Convolutional neural network for the task of scene motion segmentation by a moving camera. We convert the event stream into a 3D graph in (x,y,t) space and keep per-event temporal information. The difficulty of the task stems from the fact that unlike in metric space, the shape of an object in (x,y,t) space depends on its motion and is not the same across the dataset. We discuss properties of of the event data with respect to this 3D recognition problem, and show that our Graph Convolutional architecture is superior to PointNet++. We evaluate our method on the state of themore »art event-based motion segmentation dataset - EV-IMO and perform comparisons to a frame-based method proposed by its authors. Our ablation studies show that increasing the event slice width improves the accuracy, and how subsampling and edge configurations affect the network performance.« less