skip to main content

This content will become publicly available on June 1, 2023

Title: Learning Local-Global Contextual Adaptation for Multi-Person Pose Estimation
This paper studies the problem of multi-person pose estimation in a bottom-up fashion. With a new and strong observation that the localization issue of the center-offset formulation can be remedied in a local-window search scheme in an ideal situation, we propose a multi-person pose estimation approach, dubbed as LOGO-CAP, by learning the LOcal-GlObal Contextual Adaptation for human Pose. Specifically, our approach learns the keypoint attraction maps (KAMs) from the local keypoints expansion maps (KEMs) in small local windows in the first step, which are subsequently treated as dynamic convolutional kernels on the keypoints-focused global heatmaps for contextual adaptation, achieving accurate multi-person pose estimation. Our method is end-to-end trainable with near real-time inference speed in a single forward pass, obtaining state-of-the-art performance on the COCO keypoint benchmark for bottom-up human pose estimation. With the COCO trained model, our method also outperforms prior arts by a large margin on the challenging OCHuman dataset.
Award ID(s):
1909644 2024688 2013451
Publication Date:
Journal Name:
Sponsoring Org:
National Science Foundation
More Like this
  1. Tescher, Andrew G. ; Ebrahimi, Touradj (Ed.)
    Vehicle pose estimation is useful for applications such as self-driving cars, traffic monitoring, and scene analysis. Recent developments in computer vision and deep learning have achieved significant progress in human pose estimation, but little of this work has been applied to vehicle pose. We propose VehiPose, an efficient architecture for vehicle pose estimation, based on a multi-scale deep learning approach that achieves high accuracy vehicle pose estimation while maintaining manageable network complexity and modularity. The VehiPose architecture combines an encoder-decoder architecture with a waterfall atrous convolution module for multi-scale feature representation. Our approach aims to reduce the loss due to successive pooling layers and preserve the multiscale contextual and spatial information in the encoder feature representations. The waterfall module generates multiscale features, as it leverages the efficiency of progressive filtering while maintaining wider fields-of-view through the concatenation of multiple features. This multi-scale approach results in a robust vehicle pose estimation architecture that incorporates contextual information across scales and performs the localization of vehicle keypoints in an end-to-end trainable network.
  2. Reconstructing 4D vehicular activity (3D space and time) from cameras is useful for autonomous vehicles, commuters and local authorities to plan for smarter and safer cities. Traffic is inherently repetitious over long periods, yet current deep learning-based 3D reconstruction methods have not considered such repetitions and have difficulty generalizing to new intersection-installed cameras. We present a novel approach exploiting longitudinal (long-term) repetitious motion as self-supervision to reconstruct 3D vehicular activity from a video captured by a single fixed camera. Starting from off-the-shelf 2D keypoint detections, our algorithm optimizes 3D vehicle shapes and poses, and then clusters their trajectories in 3D space. The 2D keypoints and trajectory clusters accumulated over long-term are later used to improve the 2D and 3D keypoints via self-supervision without any human annotation. Our method improves reconstruction accuracy over state of the art on scenes with a significant visual difference from the keypoint detector’s training data, and has many applications including velocity estimation, anomaly detection and vehicle counting. We demonstrate results on traffic videos captured at multiple city intersections, collected using our smartphones, YouTube, and other public datasets.
  3. We propose UniPose, a unified framework for human pose estimation, based on our “Waterfall” Atrous Spatial Pooling architecture, that achieves state-of-art-results on several pose estimation metrics. Current pose estimation methods utilizing standard CNN architectures heavily rely on statistical postprocessing or predefined anchor poses for joint localization. UniPose incorporates contextual segmentation and joint localization to estimate the human pose in a single stage, with high accuracy, without relying on statistical postprocessing methods. The Waterfall module in UniPose leverages the efficiency of progressive filtering in the cascade architecture, while maintaining multiscale fields-of-view comparable to spatial pyramid configurations. Additionally, our method is extended to UniPoseLSTM for multi-frame processing and achieves state-of-theart results for temporal pose estimation in Video. Our results on multiple datasets demonstrate that UniPose, with a ResNet backbone and Waterfall module, is a robust and efficient architecture for pose estimation obtaining state-ofthe-art results in single person pose detection for both single images and videos
  4. Prior work on 6-DoF object pose estimation has largely focused on instance-level processing, in which a textured CAD model is available for each object being detected. Category-level 6- DoF pose estimation represents an important step toward developing robotic vision systems that operate in unstructured, real-world scenarios. In this work, we propose a single-stage, keypoint-based approach for category-level object pose estimation that operates on unknown object instances within a known category using a single RGB image as input. The proposed network performs 2D object detection, detects 2D keypoints, estimates 6- DoF pose, and regresses relative bounding cuboid dimensions. These quantities are estimated in a sequential fashion, leveraging the recent idea of convGRU for propagating information from easier tasks to those that are more difficult. We favor simplicity in our design choices: generic cuboid vertex coordinates, single-stage network, and monocular RGB input. We conduct extensive experiments on the challenging Objectron benchmark, outperforming state-of-the-art methods on the 3D IoU metric (27.6% higher than the MobilePose single-stage approach and 7.1 % higher than the related two-stage approach).
  5. Risks from human intervention in the climate system are raising concerns with respect to individual species and ecosystem health and resiliency. A dominant approach uses global climate models to predict changes in climate in the coming decades and then to downscale this information to assess impacts to plant communities, animal habitats, agricultural and urban ecosystems, and other parts of the Earth’s life system. To achieve robust assessments of the threats to these systems in this top-down, outcome vulnerability approach, however, requires skillful prediction, and representation of changes in regional and local climate processes, which has not yet been satisfactorily achieved. Moreover, threats to biodiversity and ecosystem function, such as from invasive species, are in general, not adequately included in the assessments. We discuss a complementary assessment framework that builds on a bottom-up vulnerability concept that requires the determination of the major human and natural forcings on the environment including extreme events, and the interactions between these forcings. After these forcings and interactions are identified, then the relative risks of each issue can be compared with other risks or forcings in order to adopt optimal mitigation/adaptation strategies. This framework is a more inclusive way of assessing risks, including climate variability andmore »longer-term natural and anthropogenic-driven change, than the outcome vulnerability approach which is mainly based on multi-decadal global and regional climate model predictions. We therefore conclude that the top-down approach alone is outmoded as it is inadequate for robustly assessing risks to biodiversity and ecosystem function. In contrast the bottom-up, integrative approach is feasible and much more in line with the needs of the assessment and conservation community. A key message of our paper is to emphasize the need to consider coupled feedbacks since the Earth is a dynamically interactive system. This should be done not just in the model structure, but also in its application and subsequent analyses. We recognize that the community is moving toward that goal and we urge an accelerated pace.« less