skip to main content

Title: Joint 3D Human Shape Recovery and Pose Estimation from a Single Image with Bilayer Graph
The ability to estimate the 3D human shape and pose from images can be useful in many contexts. Recent approaches have explored using graph convolutional networks and achieved promising results. The fact that the 3D shape is represented by a mesh, an undirected graph, makes graph convolutional networks a natural fit for this problem. However, graph convolutional networks have limited representation power Information from nodes in the graph is passed to connected neighbors, and propagation of information requires successive graph convolutions. To overcome this limitation, we propose a dual-scale graph approach. We use a coarse graph, derived from a dense graph, to estimate the human’s 3D pose, and the dense graph to estimate the 3D shape. Information in coarse graphs can be propagated over longer distances compared to dense graphs. In addition, information about pose can guide to recover local shape detail and vice versa. We recognize that the connection between coarse and dense is itself a graph, and introduce graph fusion blocks to exchange information between graphs with different scales. We train our model end-to-end and show that we can achieve state-of-the-art results for several evaluation datasets. The code is available at the following link,  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2021 International Conference on 3D Vision (3DV)
Page Range / eLocation ID:
505 to 514
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In this paper, we propose a supervised graph representation learning method to model the relationship between brain functional connectivity (FC) and structural connectivity (SC) through a graph encoder-decoder system. The graph convolutional network (GCN) model is leveraged in the encoder to learn lower-dimensional node representations (i.e. node embeddings) integrating information from both node attributes and network topology. In doing so, the encoder manages to capture both direct and indirect interactions between brain regions in the node embeddings which later help reconstruct empirical FC networks. From node embeddings, graph representations are learnt to embed the entire graphs into a vector space. Our end-to-end model utilizes a multi-objective loss function to simultaneously learn node representations for FC network reconstruction and graph representations for subject classification. The experiment on a large population of non-drinkers and heavy drinkers shows that our model can provide a characterization of the population pattern in the SC-FC relationship, while also learning features that capture individual uniqueness for subject classification. The identified key brain subnetworks show significant between-group difference and support the promising prospect of GCN-based graph representation learning on brain networks to model human brain activity and function. 
    more » « less
  2. Robot grasp typically follows five stages: object detection, object localisation, object pose estimation, grasp pose estimation, and grasp planning. We focus on object pose estimation. Our approach relies on three pieces of information: multiple views of the object, the camera’s extrinsic parameters at those viewpoints, and 3D CAD models of objects. The first step involves a standard deep learning backbone (FCN ResNet) to estimate the object label, semantic segmentation, and a coarse estimate of the object pose with respect to the camera. Our novelty is using a refinement module that starts from the coarse pose estimate and refines it by optimisation through differentiable rendering. This is a purely vision-based approach that avoids the need for other information such as point cloud or depth images. We evaluate our object pose estimation approach on the ShapeNet dataset and show improvements over the state of the art. We also show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates on the Object Clutter Indoor Dataset (OCID) Grasp dataset, as computed using standard practice. 
    more » « less
  3. 3D object recognition accuracy can be improved by learning the multi-scale spatial features from 3D spatial geometric representations of objects such as point clouds, 3D models, surfaces, and RGB-D data. Current deep learning approaches learn such features either using structured data representations (voxel grids and octrees) or from unstructured representations (graphs and point clouds). Learning features from such structured representations is limited by the restriction on resolution and tree depth while unstructured representations creates a challenge due to non-uniformity among data samples. In this paper, we propose an end-to-end multi-level learning approach on a multi-level voxel grid to overcome these drawbacks. To demonstrate the utility of the proposed multi-level learning, we use a multi-level voxel representation of 3D objects to perform object recognition. The multi-level voxel representation consists of a coarse voxel grid that contains volumetric information of the 3D object. In addition, each voxel in the coarse grid that contains a portion of the object boundary is subdivided into multiple fine-level voxel grids. The performance of our multi-level learning algorithm for object recognition is comparable to dense voxel representations while using significantly lower memory. 
    more » « less
  4. Measuring importance of nodes in a graph is one of the key aspects in graph analysis. Betweenness centrality (BC) measures the amount of influence that a node has over the flow of information in a graph. However, the computation complexity of calculating BC is extremely high with large-scale graphs. This is especially true when analyzing the road networks with millions of nodes and edges. In this study, we propose a deep learning architecture RoadCaps to estimate BC with sub-second latencies. RoadCaps aggregates features from neighbor nodes using Graph Convolutional Networks and estimates the node level BC by mapping low-level concept to high-level information using Capsule Networks. Our empirical benchmarks demonstrates that RoadCaps outperforms base models such as GCN and GCNFCL in both accuracy and robustness. On average, RoadCaps generates a node’s BC value in 7.5 milliseconds. 
    more » « less
  5. Abstract We define a novel type of ensemble graph convolutional network (GCN) model. Using optimized linear projection operators to map between spatial scales of graph, this ensemble model learns to aggregate information from each scale for its final prediction. We calculate these linear projection operators as the infima of an objective function relating the structure matrices used for each GCN. Equipped with these projections, our model (a Graph Prolongation-Convolutional Network) outperforms other GCN ensemble models at predicting the potential energy of monomer subunits in a coarse-grained mechanochemical simulation of microtubule bending. We demonstrate these performance gains by measuring an estimate of the Floating Point OPerations spent to train each model, as well as wall-clock time. Because our model learns at multiple scales, it is possible to train at each scale according to a predetermined schedule of coarse vs. fine training. We examine several such schedules adapted from the algebraic multigrid literature, and quantify the computational benefit of each. We also compare this model to another model which features an optimized coarsening of the input graph. Finally, we derive backpropagation rules for the input of our network model with respect to its output, and discuss how our method may be extended to very large graphs. 
    more » « less