Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Chowdhury, Arpita; Paul, Dipanjyoti; Mai, Zheda; Gu, Jianyang; Zhang, Ziheng; Mehrab, Kazi Sajeed; Campolongo, Elizabeth G; Rubenstein, Daniel; Stewart, Charles V; Karpatne, Anuj; Berger-Wolf, Tanya; Su, Yu; Chao, Wei-Lun

Citation Details

This content will become publicly available on June 1, 2026

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

We present a simple approach to make pre-trained Vision Transformers (ViTs) interpretable for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as bird species. Pre-trained ViTs, such as DINO, have demonstrated remarkable capabilities in extracting localized, discriminative features. However, saliency maps like Grad-CAM often fail to identify these traits, producing blurred, coarse heatmaps that highlight entire objects instead. We propose a novel approach, Prompt Class Attention Map (Prompt-CAM), to address this limitation. Prompt-CAM learns class-specific prompts for a pre-trained ViT and uses the corresponding outputs for classification. To correctly classify an image, the true-class prompt must attend to unique image patches not present in other classes' images (i.e., traits). As a result, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a "free lunch," requiring only a modification to the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM easy to train and apply, in stark contrast to other interpretable methods that require designing specific models and training processes. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate the superior interpretation capability of Prompt-CAM. The source code and demo are available at https://github.com/Imageomics/Prompt_CAM. more »

Award ID(s):: 2118240

PAR ID:: 10611506

Author(s) / Creator(s):: Chowdhury, Arpita; Paul, Dipanjyoti; Mai, Zheda; Gu, Jianyang; Zhang, Ziheng; Mehrab, Kazi Sajeed; Campolongo, Elizabeth G; Rubenstein, Daniel; Stewart, Charles V; Karpatne, Anuj; Berger-Wolf, Tanya; Su, Yu; Chao, Wei-Lun

Publisher / Repository:: Proceedings of the Computer Vision and Pattern Recognition Conference

Date Published:: 2025-06-01

Page Range / eLocation ID:: 4375-4385

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on June 1, 2026
Conference Paper:
The DOI is not currently available.

More Like this