Efficient Vision Transformer for Human Pose Estimation via Patch Selection. British Machine Vision Conference

Kinfu, Kaleab; Vidal, Rene

Citation Details

While Convolutional Neural Networks (CNNs) have been widely successful in 2D human pose estimation, Vision Transformers (ViTs) have emerged as a promising alternative to CNNs, boosting state-of-the-art performance. However, the quadratic computational complexity of ViTs has limited their applicability for processing high-resolution images. In this paper, we propose three methods for reducing ViT’s computational complexity, which are based on selecting and processing a small number of most informative patches while disregarding others. The first two methods leverage a lightweight pose estimation network to guide the patch selection process, while the third method utilizes a set of learnable joint tokens to ensure that the selected patches contain the most important information about body joints. Experiments across six benchmarks show that our proposed methods achieve a significant reduction in computational complexity, ranging from 30% to 44%, with only a minimal drop in accuracy between 0% and 3.5%. more »

Award ID(s):: 2124277 2430816

PAR ID:: 10540592

Author(s) / Creator(s):: Kinfu, Kaleab; Vidal, Rene

Publisher / Repository:: British Machine Vision Conference

Date Published:: 2023-09-14

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this