<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A SIMPLE INTERPRETABLE TRANSFORMER FOR FINEGRAINED IMAGE CLASSIFICATION AND ANALYSIS</title></titleStmt>
			<publicationStmt>
				<publisher>ICLR</publisher>
				<date>05/07/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10530247</idno>
					<idno type="doi"></idno>
					
					<author>Dipanjyoti Paul</author><author>Arpita Chowdhury</author><author>Xinqi Xiong</author><author>Feng-Ju Chang</author><author>David Carlyn</author><author>Samuel Stevens</author><author>Kaiya Provost</author><author>Anuj Karpatne</author><author>Bryan Carstens</author><author>Daniel Rubenstein</author><author>Charles Stewart</author><author>Tanya Berger-Wolf</author><author>Yu Su</author><author>Wei-Lun Chao</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We present a novel usage of Transformers to make image classification interpretable. Unlike mainstream classifiers that wait until the last fully connected layer to incorporate class information to make predictions, we investigate a proactive approach, asking each class to search for itself in an image. We realize this idea via a Transformer encoder-decoder inspired by DEtection TRansformer (DETR). We learn “class-specific” queries (one for each class) as input to the decoder, enabling each class to localize its patterns in an image via cross-attention. We name our approach INterpretable TRansformer (INTR), which is fairly easy to implement and exhibits several compelling properties. We show that INTR intrinsically encourages each class to attend distinctively; the cross-attention weights thus provide a faithful interpretation of the prediction. Interestingly, via “multi-head” cross-attention, INTR could identify different “attributes” of a class, making it particularly suitable for fine-grained classification and analysis, which we demonstrate on eight datasets. Our code and pre-trained models are publicly accessible at the Imageomics Institute GitHub site: https://github.com/Imageomics/INTR.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>the eight-head cross-attention maps (column-wise) triggered by the query of the ground-truth class. Each head is learned to attend to a different (across columns) but consistent (across rows) semantic cue in the image that is useful to recognize this bird species (e.g., attributes). The exception is the last row, which shows inconsistent attention. Indeed, this is a misclassified case, showcasing how INTR interprets (wrong) predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Mainstream neural networks for image classification <ref type="bibr">(He et al., 2016;</ref><ref type="bibr">Simonyan &amp; Zisserman, 2015;</ref><ref type="bibr">Krizhevsky et al., 2017;</ref><ref type="bibr">Huang et al., 2019;</ref><ref type="bibr">Szegedy et al., 2015;</ref><ref type="bibr">Dosovitskiy et al., 2021;</ref><ref type="bibr">Liu et al., 2021)</ref> typically allocate most of their model capacity to extract "class-agnostic" feature vectors from images, followed by a fully connected layer that compares image feature vectors with "class-specific" vectors to make predictions. While these models have achieved groundbreaking accuracy, their model design cannot directly explain where a model looks for predicting a particular class.</p><p>In this paper, we investigate a proactive approach to classification, asking each class to look for itself in an image. We hypothesize that this "class-specific" search process would reveal where the model looks, offering a built-in interpretation of the prediction.</p><p>At first glance, implementing this idea may need a significant model architecture design and a complex training process. However, we show that a novel usage of the Transformer encoder-decoder <ref type="bibr">(Vaswani et al., 2017)</ref> inspired by DEtection TRansformer (DETR) <ref type="bibr">(Carion et al., 2020)</ref> can essentially realize this idea, making our model fairly easy to reproduce and extend.</p><p>Concretely, the DETR encoder extracts patch-wise features from the image, and the decoder attends to them based on learnable queries. We propose to learn "class-specific" queries (one for each class) as input to the decoder, enabling the model to obtain "class-specific" image features via self-attention and cross-attention -self-attention encodes the contextual information among candidate classes, determining the patterns necessary to distinguish between classes; cross-attention then allows each class to look for the distinctive patterns in the image. The resulting "class-specific" image feature vectors (one for each class) are then compared with a shared "class-agnostic" vector to predict the label of the image. We name our model INterpretable TRansformer (INTR). Figure <ref type="figure">2</ref> illustrates the model architecture. In the training phase, we learn INTR by minimizing the cross-entropy loss. In the inference phase, INTR allows us to visualize the cross-attention maps triggered by different "class-specific" queries to understand why the model predicts or does not predict a particular class.</p><p>On the surface, INTR may fall into the debate of whether attention is interpretable <ref type="bibr">(Jain &amp; Wallace, 2019;</ref><ref type="bibr">Wiegreffe &amp; Pinter, 2019;</ref><ref type="bibr">Bibal et al., 2022)</ref>. However, we mathematically show that INTR offers faithful attention to distinguish between classes. In short, INTR computes logits by performing inner products between class-specific feature vectors and the shared class-agnostic vector. To classify an image correctly, the ground-truth class must obtain distinctive class-specific image features to claim the highest logit against other classes, which is possible only through distinct cross-attention weights. Minimizing the training loss thus encourages each class-specific query to produce distinct cross-attention weights. Manipulating the cross-attention weights in inference, as done in adversarial attacks to attention-based interpretation <ref type="bibr">(Serrano &amp; Smith, 2019)</ref>, would alter the prediction notably.</p><p>We extensively analyze INTR, especially in cross-attention. We find that the "multiple heads" in cross-attention could learn to identify different "attributes" of a class and consistently localize them in images, making INTR particularly well-suited for fine-grained classification. We validate this on multiple datasets, including <ref type="bibr">CUB-200-2011</ref><ref type="bibr">(Wah et al., 2011)</ref>, Birds-525 <ref type="bibr">(Piosenka, 2023</ref><ref type="bibr">), Oxford Pet (Parkhi et al., 2012)</ref>, Stanford Dogs <ref type="bibr">(Khosla et al., 2011)</ref>, Stanford Cars <ref type="bibr">(Krause et al., 2013)</ref>, FGVC-Aircraft <ref type="bibr">(Maji et al., 2013</ref><ref type="bibr">), iNaturalist-2021</ref><ref type="bibr">(Van Horn et al., 2021)</ref>, and Cambridge butterfly <ref type="bibr">(Montejo-Kovacevich et al., 2020)</ref>. Interestingly, by concentrating the decoder's input on visually similar classes (e.g., the mimicry in butterflies), INTR could attend to the nuances of patterns, even matching those found by biologists, suggesting its potential benefits to scientific discovery.</p><p>It is worth reiterating that INTR is built upon a widely-used Transformer encoder-decoder architecture and can be easily trained end-to-end. What makes it interpretable is the novel usage -incorporating class-specific information at the decoder's input rather than output. We view these as key strengths and contributions. They make INTR easily applicable, reproducible, and extendable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND AND RELATED WORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">WHAT KIND OF INTERPRETATIONS ARE WE LOOKING FOR?</head><p>As surveyed in <ref type="bibr">(Zhang &amp; Zhu, 2018;</ref><ref type="bibr">Burkart &amp; Huber, 2021;</ref><ref type="bibr">Carvalho et al., 2019;</ref><ref type="bibr">Das &amp; Rad, 2020;</ref><ref type="bibr">Buhrmester et al., 2021;</ref><ref type="bibr">Linardatos et al., 2020)</ref>, various ways exist to explain or interpret a model's prediction (see Appendix A for more details). Among them, the most popular is localizing where the model looks for predicting a particular class. We follow this notion and focus on fine-grained classification (e.g., bird and butterfly species). That is, not only do we want to localize the coarsegrained objects (e.g., birds and butterflies), but we also want to identify the "attributes" (e.g., wing patterns) that are useful to distinguish between fine-grained classes. We note that an attribute can be decomposed into "object part <ref type="bibr">" (e.g., head, tail, wing, etc.)</ref> and "property" (e.g., patterns on the wings), in which the former is commonly shared across all classes <ref type="bibr">(Wah et al., 2011)</ref>. We thus expect that our approach could identify the differences within a part between classes, not just localize parts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">BACKGROUND AND NOTATION</head><p>We denote an image and its ground-truth label by I and y, respectively. To perform classification over C classes, mainstream neural networks learn a feature extractor f &#952; to obtain a feature map X = f &#952; (I) &#8712; R D&#215;H&#215;W . Here, &#952; denotes the parameters; D denotes the number of channels; H and W denote the number of grids in the height and width dimensions. For instance, ResNet <ref type="bibr">(He et al., 2016)</ref> realizes f &#952; by a convolutional neural network (ConvNet) with residual links; Vision Transformer (ViT) <ref type="bibr">(Dosovitskiy et al., 2021)</ref> realizes it by a Transformer encoder. Normally, this feature map is reshaped and/or pooled into a feature vector denoted by x = Vect(X), which then undergoes inner products with C class-specific vectors {w c } C c=1 . The class with the largest inner product is outputted as the predicted label,</p><p>(1)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">RELATED WORK ON POST-HOC EXPLANATION AND SELF-INTERPRETABLE METHODS</head><p>Since this classification process does not explicitly localize where the model looks to make predictions, the model is often considered a black box. To explain the prediction, a post-hoc mechanism is needed <ref type="bibr">(Ribeiro et al., 2016;</ref><ref type="bibr">Koh &amp; Liang, 2017;</ref><ref type="bibr">Yuan et al., 2021;</ref><ref type="bibr">Qiang et al., 2022;</ref><ref type="bibr">Zhou et al., 2015)</ref>. For instance, CAM <ref type="bibr">(Zhou et al., 2016)</ref> and Grad-CAM <ref type="bibr">(Selvaraju et al., 2017)</ref> obtain class activation maps (CAM) by back-propagating class-specific gradients to the feature map. RISE <ref type="bibr">(Petsiuk et al., 2018)</ref> iteratively masks out image contents to identify essential regions for classification. These methods have been widely used. However, they are often low-resolution (e.g., blurred or indistinguishable across classes), computation-heavy, and not necessarily aligned with how models make predictions.</p><p>To address these drawbacks, another branch of work designs models with interpretable prediction processes, incorporating explicit mechanisms that allow for a direct understanding of the predictions <ref type="bibr">(Wang et al., 2021;</ref><ref type="bibr">Donnelly et al., 2022;</ref><ref type="bibr">Rigotti et al., 2021;</ref><ref type="bibr">Kim et al., 2022;</ref><ref type="bibr">Bau et al., 2017;</ref><ref type="bibr">Zhou et al., 2018)</ref>. For example, ProtoPNet <ref type="bibr">(Chen et al., 2019)</ref> compares the feature map X to "learnable prototypes" of each class, resulting in a feature vector x whose elements are semantically meaningful: the d-th dimension corresponds to a prototypical part of a certain class and x <ref type="bibr">[d]</ref> indicates its activation in the image. By reading x and visualizing the activated prototypes, one could better understand the model's decision. Inspired by ProtoPNet, ProtoTree <ref type="bibr">(Nauta et al., 2021)</ref> arranges the comparison to prototypes in a tree structure to mimic human reasoning; ProtoPFormer <ref type="bibr">(Xue et al., 2022)</ref> presents a Transformer-based realization of ProtoPNet, which was originally based on ConvNets. Along with these interpretable decision processes, however, come specifically tailored architecture designs and increased complexity of the training process, often making them hard to reproduce, adapt, or extend. For instance, ProtoPNet requires a multi-stage training strategy, each stage taking care of a portion of the learnable parameters including the prototypes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">INTERPRETABLE TRANSFORMER (INTR)</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">MOTIVATION AND BIG PICTURE</head><p>Taking into account the pros and cons of the above two paradigms, we ask, Can we obtain interpretability via standard neural network architectures and standard learning algorithms?</p><p>To respond to "interpretability", we investigate a proactive approach to classification, asking each class to search for its presence and distinctive patterns in an image. Denote by S the set of candidate classes; we propose a new classification rule,</p><p>where g &#981; (f &#952; (I), c, S) represents the image feature vector extracted specifically for class c in the context of S, and w denotes a binary classifier determining whether class c is present in the image I. Compared to Equation 1, the new classification rule in Equation 2 incorporates class-specific information in the feature extraction stage, not in the final fully connected layer. As will be shown in subsection 3.4, this design is the key to generating faithful attention for interpretation.</p><p>To respond to "standard neural network architectures", we find that the Transformer encoder-decoder <ref type="bibr">(Vaswani et al., 2017)</ref>, which is widely used in object detection <ref type="bibr">(Carion et al., 2020;</ref><ref type="bibr">Zhu et al., 2021)</ref> and natural language processing <ref type="bibr">(Wolf et al., 2020)</ref>, could essentially realize Equation 2. Specifically, the encoder extracts the image feature map X = f &#952; (I). For the decoder, we propose to learn C class-specific queries {z (c) in } C c=1 as input, enabling it to extract the feature vector g &#981; (f &#952; (I), z</p><p>in , S) for class c via cross-attention.</p><p>To ease the description, let us first focus on cross-attention, the key building block in Transformer decoders in subsection 3.2. We then introduce our full model in subsection 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">INTERPRETABLE CLASSIFICATION VIA CROSS-ATTENTION</head><p>Cross-attention. Cross-attention can be seen as a (soft) retrieval process. Given an input query vector z in &#8712; R D , it finds similar vectors from a vector pool and combines them via weighted average. In our application, this pool corresponds to the feature map X. Without loss of generality, let us reshape the feature map X</p><p>With z in and X, cross-attention performs the following sequence of operations. First, it projects z in and X to a common embedding space such that they can be compared, and separately projects X to another space to emphasize the information to be combined,</p><p>Then, it performs an inner product between q and K, followed by Softmax, to compute the similarities between z in and vectors in X, and uses the similarities as weights to combine vectors in V linearly,</p><p>where &#8730; D is a scaling factor based on the dimensionality of features. In other words, the output of cross-attention is a vector z out that aggregates information in X according to the input query z in .</p><p>Class-specific queries. Inspired by the inner workings of cross-attention, we propose to learn C "class-specific" query vectors</p><p>in ] &#8712; R D&#215;C , one for each class. We expect each of these queries to look for the "class-specific" distinctive patterns in X. The output vectors</p><p>out ] &#8712; R D&#215;C thus should encode whether each class finds itself in the image,</p><p>We note that the Softmax is taken over elements of each column; i.e., in Equation <ref type="formula">5</ref>, each column in Z in attends to X independently. We use superscript/subscript to index columns in Z/X.</p><p>Classification rule. We compare each vector in Z out to a learnable "presence" vector w &#8712; R D to determine whether each class is found in the image. The predicted class is thus</p><p>Training. As each class obtains a logit w &#8868; z (c) out , we employ the cross-entropy loss,</p><p>coupled with stochastic gradient descent (SGD) to optimize the learnable parameters, including Z in , w, and the projection matrices W q , W k , and W v in Equation 3. This design responds to the final piece of question in subsection 3.1, "standard learning algorithms".</p><p>Inference and interpretation. We follow Equation 6 to make predictions. Meanwhile, each column of the cross-attention weights Softmax( <ref type="formula">5</ref>reveals where each class looks to find itself, enabling us to understand why the model predicts or does not predict a class. We note that this built-in interpretation does not incur additional computation costs like post-hoc explanation.</p><p>Multi-head attention. It is worth noting that a standard cross-attention block has multiple heads. It learns multiple sets of matrices (W q, r , W k, r , W v, r ) in Equation 3, r &#8712; {1, &#8226; &#8226; &#8226; , R}, to look for different patterns in X, resulting in multiple Softmax(</p><p>and Z out, r in Equation <ref type="formula">5</ref>. This enables the model to identify different "attributes" of a class and allows us to visualize them.</p><p>In training and inference, {Z out, r } R r=1 are concatenated row-wise, followed by another learnable matrix W o to obtain a single Z out as in Equation <ref type="formula">5</ref>,</p><p>(8) As such, Equation 7 and Equation 6 are still applicable to optimize the model and make predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">OVERALL MODEL ARCHITECTURE (SEE FIGURE 2 FOR AN ILLUSTRATION)</head><p>We implement our full INterpretable TRansformer (INTR) model (cf. Equation <ref type="formula">2</ref>) using a Transformer decoder <ref type="bibr">(Vaswani et al., 2017)</ref> on top of a feature extractor f &#952; that produces a feature map X. Without loss of generality, we use the DEtection TRansformer (DETR) <ref type="bibr">(Carion et al., 2020)</ref> as the backbone. DETR uses a Transformer decoder of multiple layers; each contains a cross-attention block. The output vectors of one layer become the input vectors of the next layer. In DETR, the input to the decoder (at its first layer) is a set of object proposal queries, and we replace it with our learnable "class-specific" query vectors</p><p>The Transformer decoder then outputs the "class-specific" feature vectors Z out that will be fed into Equation <ref type="formula">6</ref>. Using a Transformer decoder rather than a single cross-attention block has several advantages. First, with multiple decoder layers, the learned queries Z in can improve over layers by grounding themselves on the image. Second, the self-attention block in each decoder layer allows class-specific queries to exchange information to encode the context. (See Appendix C for details.) As shown in Figure <ref type="figure">15</ref>, the cross-attention blocks in later layers can attend to more distinctive patterns.</p><p>Training. INTR has three sets of learnable parameters: a) the parameters in the DETR backbone, including f &#952; ; b) the class-specific input queries Z in &#8712; R D&#215;C to the decoder; and c) the class-agnostic vector w. We train all these parameters end-to-end via SGD, using the loss in Equation <ref type="formula">7</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">HOW DOES INTR LEARN TO PRODUCE INTERPRETABLE CROSS-ATTENTION WEIGHTS?</head><p>We analyze how INTR offers interpretability. For brevity, we focus on the model in subsection 3.2.</p><p>Attention vs. interpretation. There has been an ongoing debate on whether attention offers faithful interpretation <ref type="bibr">(Wiegreffe &amp; Pinter, 2019;</ref><ref type="bibr">Jain &amp; Wallace, 2019;</ref><ref type="bibr">Serrano &amp; Smith, 2019;</ref><ref type="bibr">Bibal et al., 2022)</ref>. Specifically, <ref type="bibr">Serrano &amp; Smith (2019)</ref> showed that significantly manipulating the attention weights at inference time does not necessarily change the model's prediction. Here, we provide a mathematical explanation for why INTR may not suffer from the same problem. The key is in our classification rule. In Equation <ref type="formula">6</ref>, we obtain the logit for class c by w &#8868; z (c) out . If c is the ground-truth label, it must obtain a logit larger than other classes c &#8242; &#824; = c to make a correct prediction. This implies z</p><p>out , which is possible only if the cross-attention weights triggered by z Unveiling the inner workings. We dig deeper to understand what INTR learns. For class c to obtain a high logit in Equation <ref type="formula">6</ref>, z (c) out must have a large inner product with the class-agnostic w. We note that z <ref type="formula">4</ref>), and V is obtained by applying a projection matrix W v to the feature map</p><p>out can be rewritten as</p><p>where</p><p>. We note that s n does not depend on the class-specific query z (c)</p><p>in . It only depends on the input image I, or more specifically, the feature map X and how it aligns with the vector w. In other words, we can view s n as an "image-specific" salient score for patch n. In contrast, &#945; (c) [n] depends on the class-specific query z (c) in ; its value will be high if class c finds the distinctive patterns in patch n.</p><p>Building on this insight and Equation <ref type="formula">9</ref>, if class c is the ground-truth class, what its query z (c) in needs to do is putting its attention weights &#945; (c) on those high-score patches. Namely, class c must find its distinctive patterns in the salient image regions. Putting things together, we can view the roles of W v and W k as "disentanglement". They disentangle the information in x n into "image-specific" and "classification-specific" components -the former highlights "whether a patch should be looked at"; the latter highlights "what distinctive patterns it contains". When multi-head cross-attention is used, each pair of (W v , W k ) can learn to highlight an object "part" and the distinctive "property" in it. These offer the opportunity to localize the "attributes" of a class. See Appendix B for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">COMPARISON TO CLOSELY RELATED WORK</head><p>ProtoPNet <ref type="bibr">(Chen et al., 2019)</ref> and Concept Transformers (CT) <ref type="bibr">(Rigotti et al., 2021)</ref>. INTR is fundamentally different in two aspects. First, both methods aim to represent image patches by a set of learnable vectors (e.g., prototypes in ProtoPNet; concepts in CT<ref type="foot">foot_0</ref> ). The resulting features for image patches are then pooled into a vector x and undergo a fully connected layer for classification. In other words, their classification rules still follow Equation 1. In contrast, INTR extracts classspecific features from the image (one per class) and uses a new classification rule to make predictions (cf. Equation <ref type="formula">6</ref>). Second, both methods require specifically designed training strategies or signals. For example, CT needs human annotations to learn the concepts. In contrast, INTR is based on a standard model architecture and training algorithm and requires no additional human supervision. DINO-v1 <ref type="bibr">(Caron et al., 2021)</ref>. DINO-v1 shows that the "[CLS]" token of a pre-trained ViT <ref type="bibr">(Dosovitskiy et al., 2021)</ref> can attend to different "parts" of objects via multi-head attention. While this shares some similarities with our findings in INTR, what INTR attends to are "attributes" that can be used to distinguish between fine-grained classes, not just "parts" that are shared among classes. Model. We implement INTR on top of the DETR backbone <ref type="bibr">(Carion et al., 2020)</ref>. DETR stacks a Transformer encoder on top of a ResNet as the feature extractor. We use its DETR-ResNet-50 version, in which the ResNet-50 <ref type="bibr">(He et al., 2016)</ref> was pre-trained on ImageNet-1K <ref type="bibr">(Russakovsky et al., 2015;</ref><ref type="bibr">Deng et al., 2009)</ref> and the whole model including the Transformer encoder-decoder <ref type="bibr">(Vaswani et al., 2017)</ref> was further trained on MSCOCO <ref type="bibr">(Lin et al., 2014)</ref> <ref type="foot">foot_1</ref> . We remove its prediction heads located on top of the decoder and add our class-agnostic vector w; we remove its object proposal queries and add our C learnable class-specific queries (e.g., for CUB, C = 200). See Figure <ref type="figure">2</ref> for an illustration and subsection 3.3 for more details. We further remove the positional encoding that was injected into the cross-attention keys in the DETR decoder: we find this information adversely restricts our queries to look at particular grid locations and leads to artifacts. We note that DETR sets its feature map size D &#215; H &#215; W (at the encoder output) as 256 &#215; H0 32 &#215; W0 32 , where H 0 and W 0 are the height and width resolutions of the input image. For example, a typical CUB image is of a resolution roughly 800 &#215; 1200; thus, the resolution of the feature map and cross-attention map is roughly 25 &#215; 38. We investigate other encoders and the number of attention heads and decoder layers in Appendix F.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTS</head><p>Visualization. We visualize the last (i.e., sixth) decoder layer, whose cross-attention block has eight heads. We superimpose the cross-attention weight (maps) on the input images.</p><p>Training detail. The hyper-parameter details such as epochs, learning rate, and batch size for training INTR are reported in Appendix E. We use the Adam optimizer (Kingma &amp; Ba, 2014) with its default hyper-parameters. We train INTR using the StepLR scheduler with a learning rate drop at 80 epochs. The rest of the hyper-parameters follow DETR.</p><p>Baseline. We consider two sets of baseline methods. First, we use a ResNet-50 <ref type="bibr">(He et al., 2016)</ref> pre-trained on ImageNet-1K and fine-tune it on each dataset. We then use Grad-CAM <ref type="bibr">(Selvaraju et al., 2017)</ref> and RISE Petsiuk et al. ( <ref type="formula">2018</ref>) to construct post-hoc saliency maps: the results are kept in Appendix F. Second, we compare to models designed for interpretability, such as ProtoPNet <ref type="bibr">(Chen et al., 2019)</ref>, ProtoTree <ref type="bibr">(Nauta et al., 2021)</ref>, and ProtoPFormer <ref type="bibr">(Xue et al., 2022)</ref>. We understand that these are by no means a comprehensive set of existing works. Our purpose in including them is to treat them as references for what kind of interpretability INTR can offer with its simple design.</p><p>Evaluation. We reiterate that achieving a high classification accuracy is not the goal of this paper. The goal is to demonstrate the interpretability. We thus focus our evaluation on qualitative results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">EXPERIMENTAL RESULTS</head><p>Table 2: Accuracy(%) comparison. Model Bird CUB BF Fish Dog Pet Car Craft ResNet 98.5 83.8 95.6 71.9 77.1 89.5 89.3 80.9 INTR 97.4 71.8 95.0 81.1 72.5 90.4 86.8 76.1</p><p>Accuracy comparison. It is crucial to emphasize that the primary objective of INTR is to promote interpretability, not to claim high accuracy. Nevertheless, we report in Table <ref type="table">2</ref> the classification accuracy of INTR and ResNet-50 on all eight datasets. INTR obtains comparable accuracy on most of the datasets except for CUB (12% worse) and Fish (9.2% better). We note that both CUB and Bird datasets focus on fine-grained bird species. The main difference is that the Bird dataset offers higher-quality images (e.g., cropped to focus on objects). INTR's accuracy drop on CUB thus more likely results from its inability to handle images with complex backgrounds or small objects, not its inability to recognize bird species. Dataset (Class) Input Image Visual Demonstration CUB (Baltimore Oriole) Bird (Blue Dacnis) Dog (Walker Hound) Fish (Damselfish / Mecaenichthys immaculatus) Dataset (Class) Input Image Visual Demonstration Craft (Spitfire) Car (Acura TSX Sedan 2012) Pet (German Shorthaired) Butterfly (Godyris Zavaleta) Figure 4: INTR on all eight datasets. We show the top four cross-attention maps per test example triggered by the ground-truth classes (based on the peak un-normalized attention weights in the maps). As the indices of the top maps may not be the same across test examples, the attributes may not be the same in each column. Reference Image Head-1 Head-2 Head-3 Head-4 Head-5 Head-6 Head-7 Head-8 Red-winged Blackbird!! Do you see yourself? Answer Yes No (American Crow) American Crow Red-winged Blackbird Orchard Oriole!! Do you see yourself? No (Baltimore Oriole) Yes Baltimore Oriole Orchard Oriole Reference Image Head-1 Head-2 Head-3 Head-4 Head-5 Head-6 Head-7 Head-8 Answer Figure 5: INTR can identify tiny image manipulations that distinguish between classes. On the top, we remove the red spots of the Red-winged Blackbird. After that, INTR cannot correctly classify the image -the parentheses in the Answer column highlight the predicted classes. On the bottom, we change the color of the bird's belly (Baltimore Oriole) to make it look like Orchard Oriole. After that, INTR would misclassify it as Orchard Oriole. Both results demonstrate INTR's sensitivity to visual attributes. 4.2 FURTHER ANALYSIS AND DISCUSSION ABOUT INTR INTR can consistently identify attributes. We first analyze whether different cross-attention heads identify different attributes of a class and if those attributes are consistent across images of the same class.</p><p>Figure <ref type="figure">1</ref> shows a result (please see the caption for details). Different columns correspond to different heads, and we see that each captures a distinct attribute that is consistent across images. Some of them are very fine-grained, such as Head-4 (tail pattern) and Head-5 (breast color). The reader may notice the less concentrated attention in the last row. Indeed, it is a misclassified case: the query of the ground-truth class (i.e., Painted Bunting) cannot find itself in the image. This showcases how INTR interprets incorrect predictions. We show more results in Appendix G.</p><p>INTR is applicable to a variety of domains. Figure <ref type="figure">4</ref> shows the cross-attention results on all eight datasets. (See the caption for details.) INTR can identify the attributes well in all of them, demonstrating its remarkable generalizability and applicability.</p><p>INTR offers meaningful interpretation about attribute manipulation. We investigate INTR's response to image manipulation by deleting (the first block of Figure <ref type="figure">5</ref>) and adding (the second block of Figure <ref type="figure">5</ref>) important attributes. We obtain human-identified attributes of Red-winged Blackbird (the first block) and Orchard Oriole (the second block) from (Cor) and manipulate them accordingly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Heliconius melpomene, compared with all query</head><p>Heliconius melpomene, only compared with Heliconius elevatus As shown in Figure <ref type="figure">5</ref>, INTR is sensitive to the attribute changes; the cross-attention maps change drastically at the manipulated parts. These results suggest that INTR's inner working is heavily dependent on attributes to make correct classifications.</p><p>INTR can attend differently based on the context. As mentioned in subsection 3.3, the selfattention block in INTR's decoder could encode the context of candidate classes to determine the patterns necessary to distinguish between them. When all the class-specific queries (e.g., 65 classes in the BF dataset) are inputted to the decoder, INTR needs to identify sufficient patterns (e.g., both coarse-grained and fine-grained) to distinguish between all of them. Here, we investigate whether limiting the input queries to visually similar ones would encourage the model to attend to finer-grained attributes. We focus on the BF dataset and compare two species, Heliconius melpomene (blue box in Figure <ref type="figure">6</ref>) and Heliconius elevatus (green box in Figure <ref type="figure">6</ref>), whose visual difference is very subtle. We limit the input queries by setting other queries as zero vectors. As shown in Figure <ref type="figure">6</ref>, this modification does allow INTR to localize nuances of patterns between the two classes.</p><p>Concerns regarding an MSCOCO-pre-trained backbone. We understand this may cause concern about data leakage and unfair comparison. We note that MSCOCO only offers bounding boxes for objects, not for parts, and it does not contain fine-grained labels. Regarding fair comparisons, our work is not to claim higher accuracy but to offer a new perspective. We use DETR to demonstrate that our idea can be easily compatible with pre-trained encoder-decoder (foundation) models.</p><p>Limitations. INTR learns C class-specific queries that must be inputted to the Transformer decoder jointly. This could increase the training and inference time if C is huge, e.g., larger than the number of grids N in the feature map. Fortunately, fine-grained classification (e.g., for species in the same family or order) usually focuses on a small set of visually similar categories; C is usually not large.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSION</head><p>We present Interpretable Transformer (INTR), a simple yet effective interpretable classifier building upon standard Transformer encoder-decoder architectures. INTR makes merely two changes: learning class-specific queries (one for each class) as input to the decoder and learning a class-agnostic vector on top of the decoder output to determine whether a class is present in the image. As such, INTR can be easily trained end-to-end. During inference, the cross-attention weights triggered by the winning class-specific query indicate where the model looks to make the prediction. We conduct extensive experiments and analyses to demonstrate the effectiveness of INTR in interpretation. Specifically, we show that INTR can localize not only object parts like bird heads but also attributes (like patterns around eyes) that distinguish one bird species from others. In addition, we present a mathematical explanation of why INTR can learn to produce interpretable cross-attention for each class without ad-hoc model design, complex training strategies, and auxiliary supervision. We hope that our study can offer a new way of thinking about interpretable machine learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX</head><p>We provide details omitted in the main paper.</p><p>&#8226; Appendix A: related work (cf. subsection 2.1 of the main paper).</p><p>&#8226; Appendix B: additional details of inner workings and visualization (cf. subsection 3.2 and subsection 3.4 of the main paper).</p><p>&#8226; Appendix C: additional details of model architectures (cf. subsection 3.3 of the main paper).</p><p>&#8226; Appendix D: details of dataset (cf. section 4 of the main paper).</p><p>&#8226; Appendix E: details of experimental setup (cf. section 4 of the main paper).</p><p>&#8226; Appendix F: additional experimental results (subsection 4.1 of the main paper).</p><p>&#8226; Appendix G: additional qualitative results and analysis (cf. subsection 4.1 of the main paper).</p><p>&#8226; Appendix H: additional discussion (cf. section 5 of the main paper).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A RELATED WORK</head><p>In recent years, there has been a significant increase in the size and complexity of models, prompting a surge in research and development efforts focused on enhancing model interpretability. The need for interpretability arises not only from the goal of instilling trust in a model's predictions but also from the desire to comprehend the reasoning behind a model's predictions, gain insight into its internal mechanisms, and identify the specific input features it relies on to make accurate predictions. Numerous research directions have emerged to facilitate model interpretation for human understanding. One notable research direction involves extracting and visualizing the salient regions in an input image that contribute to the model's prediction. By identifying these regions, researchers aim to provide meaningful explanations that highlight the relevant aspects of the input that influenced the model's decision. Existing efforts in this domain can be broadly categorized into post hoc methods and self-interpretable models.</p><p>Post hoc methods involve applying interpretation techniques after a model has been trained. These methods focus on analyzing the model's behavior without modifying its architecture or training process. Most CNN-based classification processes lack explicit information on where the model focuses its attention during prediction. Post hoc methods address this limitation by providing interpretability and explanations for pre-trained black box models without modifying the model itself. For instance, CAM <ref type="bibr">(Zhou et al., 2016)</ref> computes a weighted sum of feature maps from the last convolutional layer based on learned fully connected layer weights, generating a single heat map highlighting relevant regions for the predicted class. GRAD-CAM <ref type="bibr">(Selvaraju et al., 2017)</ref> employs gradient information flowing into the last convolutional layer to produce a heatmap, with the gradients serving as importance weights for feature maps, emphasizing regions with the greatest impact on the prediction. <ref type="bibr">Koh &amp; Liang (2017)</ref> introduce influence functions, which analyze gradients of the model's loss function with respect to training data points, providing a measure of their influence on predictions. Another approach in post hoc methods involves perturbing or sampling the input image. For example, LIME <ref type="bibr">(Ribeiro et al., 2016)</ref> utilizes superpixels to generate perturbations of the input image and explain predictions of a black box model. RISE <ref type="bibr">(Petsiuk et al., 2018)</ref> iteratively blocks out parts of the input image, classifies the perturbed image using a pre-trained model, and reveals the blocked regions that lead to misclassification. However, post hoc methods for model interpretation can be computationally expensive, making them less scalable for real-world applications. Moreover, these methods may not provide precise explanations or a comprehensive understanding of how the model makes decisions, affecting the reliability and robustness of the interpretation results obtained.</p><p>Self-interpretable models are designed with interpretability as a core principle. These models incorporate explicit mechanisms or structures that allow for a direct understanding of their decisionmaking process. One direction is prototype-based models. Prototypes are visual representations of concepts that can be used to explain how a model works. The first work of using prototypes to describe the DNN model's prediction is ProtoPNet <ref type="bibr">(Chen et al., 2019)</ref>, which learns a predetermined number of prototypical parts (prototypes) per class. To classify an image, the model calculates the similarity between a prototype and a patch in the image. This similarity is measured by the distance between the two patches in latent space. Inspired by ProtoPNet, ProtoTree <ref type="bibr">(Nauta et al., 2021</ref>) is a hierarchical neural network architecture that learns class-agnostic prototypes approximated by a decision tree. This significantly decreases the required number of prototypes for interpreting a prediction than ProtoPNet.</p><p>ProtoPNet and its variants were originally designed to work with CNN-based backbones. However, they can also be used with ViTs (Vision Transformer) by removing the class token. This approach, however, has several limitations. First, prototypes are more likely to activate in the background than in the foreground. When activated in the foreground, their activation is often scattered and fragmented. Second, prototype-based methods are computationally heavy and require domain knowledge to fix the parameters. With the widespread use of transformers in computer vision, many approaches have been proposed to interpret their classification predictions. These methods often rely on attention weights to visualize the important regions in the image that contribute to the prediction. ProtoPFormer addresses this problem by applying the prototype-based method to ViTs. However, these prototypebased methods are computationally expensive and require domain knowledge to set the parameters. ProtoPFormer <ref type="bibr">(Xue et al., 2022)</ref> works on solving the problem by applying the prototype-based method with ViTs. However, these prototype-based works are computationally heavy and require domain knowledge to fix the parameters. ViT-Net <ref type="bibr">(Kim et al., 2022)</ref> integrates ViTs and trainable neural trees based on ProtoTree, which only uses ViTs as feature extractors without fully exploiting their architectural characteristics. Another recent work, Concept Transformer <ref type="bibr">(Rigotti et al., 2021)</ref>, utilizes patch embeddings of an image as queries and attributes from the dataset as keys and values within a transformer. This approach allows the model to obtain multi-head attention weights, which are then used to interpret the model's predictions. However, a drawback of this method is that it relies on human-defined attribute annotations for the dataset, which can be prone to errors and is costly as it necessitates domain expert involvement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B ADDITIONAL DETAILS OF INNER WORKINGS AND VISUALIZATION</head><p>Interpretability vs. model capacity. We investigate whether the conventional classification rule in Equation 1 induces the same property discussed in subsection 3.4. We replace w &#8868; z (c)</p><p>out ; i.e., we learn for each class a class-specific w c . This can be thought of as increasing the model capacity by introducing additional learnable parameters. The resulting classification rule is</p><p>Here, even if z</p><p>out , class c can still claim the highest logit as long as z</p><p>out has a larger inner product with w c than other w &#8868; c &#8242; z</p><p>out . Namely, even if the cross-attention weights triggered by different class-specific queries are identical,<ref type="foot">foot_2</ref> as long as the extracted features in X are correlated strongly enough with class c, the model can still predict correctly. Thus, the learnable queries z</p><p>need not necessarily learn to produce distinct and meaningful cross-attention weights. Indeed, as shown in Figure <ref type="figure">7</ref>, we implement a variant of our approach INTR-FC with its classification rule replaced by Equation <ref type="formula">10</ref>. INTR produces more distinctive (column-wise) and consistent (rowwise) attention.</p><p>Visualization. In subsection 3.4 of the main paper, we show how the logit of class c can be decomposed into</p><p>where s n = w &#8868; W v x n ;</p><p>(11)</p><p>The index n corresponds to a grid location (or column) in the feature map X &#8712; R D&#215;N . Based on Equation 11, to predict an input image as class c, the cross-attention map &#945; (c) triggered by the class-specific query z (c) in should align with the image-specific scores [s 1 , &#8226; &#8226; &#8226; , s N ].</p><p>In other words, for an image that is predicted as class c, the cross-attention map &#945; (c) very much implies which grids in an image have higher scores. Hence, in the qualitative visualizations, we only show the cross-attention map &#945; (c) rather than the image-specific scores.</p><p>We note that throughout the whole paper, INTR learns to identify attributes that are useful to distinguish classes without relying on the of human experts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C ADDITIONAL DETAILS OF MODEL ARCHITECTURES</head><p>Our idea in subsection 3.2 can be realized by the standard Transformer decoder <ref type="bibr">(Vaswani et al., 2017)</ref> on top of any "class-agnostic" feature extractors that produce a feature map X (e.g., ResNet <ref type="bibr">(He et al., 2016)</ref> or ViT <ref type="bibr">(Dosovitskiy et al., 2021)</ref>). A Transformer decoder often stacks M layers of the same decoder architecture denoted by {L m } M m=1 . Each layer L m takes a set of C vector tokens as input and produces another set of C vector tokens as output, which can then be used as the input to the subsequent layer L m+1 . In our application, the learnable "class-specific" query vectors</p><p>Within each decoder layer is a sequence of building blocks. Without loss of generality, let us omit the layer normalization, residual connection, and Multi-Layer Perceptron (MLP) operating on each token independently, but focus on the Self-Attention (SA) and the subsequent Cross-Attention (CA) blocks.</p><p>An SA block is very similar to the CA block introduced in subsection 3.1. The only difference is the pool of vectors to be retrieved -while a CA block attends to the feature map extracted from the image, the SA block attends to its input tokens. That is, in an SA block, the X &#8712; R D&#215;N matrix in Equation 3 is replaced by the input matrix Z in &#8712; R D&#215;C . This allows each query token z (c) in &#8712; R D to combine information from other query tokens, resulting in a new set of C query tokens. This new set of query tokens is then fed into a CA block that attends to the image features in X to generate the "class-specific" feature tokens.</p><p>As a Transformer decoder stacks multiple layers, the input tokens to the second layers and beyond possess not only the "learnable" class-specific information in Z in but also the class-specific feature information from X. We note that an SA block can aggregate information not only from similar tokens<ref type="foot">foot_3</ref> but also from dissimilar tokens. For example, when W q is an identity matrix and W k = -W q , a pair of similar tokens in Z in will receive smaller weights than a pair of dissimilar tokens. This allows similar query tokens to be differentiated if their relationships to other tokens are different, enabling the model to distinguish between semantically or visually similar fine-grained classes. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D DETAILS OF DATASETS</head><p>We present the detailed dataset statistics in Table <ref type="table">3</ref>. We download the butterfly (BF) dataset from the Heliconiine Butterfly Collection Records<ref type="foot">foot_4</ref> at the University of Cambridge. These downloaded datasets exhibit class imbalances. To address this, we performed a selection process on the downloaded data as follows: First, we consider classes with a minimum of B images, where B is set to 20. Subsequently, for each class, we retained at least K images for testing, with K set to 3. Throughout this process, we also ensured that we had no more than M training images, where M is defined as 5 times the quantity (B -K). The dataset statistics are presented in Table <ref type="table">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E DETAILS OF EXPERIMENTAL SETUP</head><p>During our experiment, for all datasets, except for Bird, we set the learning rate to 1 &#215; e -4 , while for Bird, we use a learning rate of 5 &#215; e -5 . Additionally, we utilize a batch size of 16 for Bird, Dog, and Fish datasets, and a batch size of 12 for the other datasets. Furthermore, the number of epochs required for training is 100 for BF and Pet datasets, 170 for Dog, and 140 for the remaining datasets. We further perform ablation studies on different numbers of attention heads and decoder layers. The results are reported in Table <ref type="table">5</ref>. We find that the setup by DETR (i.e., 8 heads and 6 decoder layers) performs the best. Comparisons to post-hoc explanation methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F ADDITIONAL EXPERIMENTAL RESULTS</head><p>We use Grad-CAM <ref type="bibr">(Selvaraju et al., 2017)</ref> and <ref type="bibr">RISE (Petsiuk et al., 2018)</ref> to construct post-hoc saliency maps on the ResNet-50 <ref type="bibr">(He et al., 2016)</ref> classifiers. We also report the insertion and deletion metric scores <ref type="bibr">(Petsiuk et al., 2018)</ref> to quantify the results. It is worth mentioning that insertion and metrics were designed to quantify post-hoc explanation methods. However, here we show a comparison between INTR, RISE, and Grad-CAM. We examine the CUB dataset images that are accurately classified by both ResNet50 and INTR, resulting in a reduced count of 3, 582 validation images. We generate saliency maps using Grad-CAM and INTR and then rank the patches to assess the insertion and deletion metrics. For a fair comparison, we employ ResNet-50 as a shared classifier for evaluation. The results are reported in Table <ref type="table">6</ref>.  G ADDITIONAL QUALITATIVE RESULTS AND ANALYSIS Figure 9 offers additional results of Figure 3. In Figure 3 of the main paper, we visualize the top three cross-attention heads or prototypes. Figure 9 further shows all the prototypes or attention heads for the same test image featuring the Painted Bunting species.</p><p>To gain further insights into the detected attributes, we compare INTR with ProtoPFormer, a prominent method in our previous evaluations. We randomly picked five images from each of the four species sampled uniformly from the CUB dataset. Figure <ref type="figure">10</ref>, shows the attention heads detected by these methods for four images, each from a different species. We validate the attributes detected by these methods through a human study. We provide detected attention heads and image-level attribute information (available in the CUB metadata) to seven individuals, who are unfamiliar with the work. We instruct them to list all attributes they believe are captured by the attention heads.</p><p>An attribute is deemed detected if more than half of the individuals identify it from the attention Published as a conference paper at ICLR 2024 !htbp Test Image Heermann Gull Heermann Gull Ring Billed Gull Slaty Backed Gull Northern Fulmar Laysan Albatross Herring Gull Western Gull Caspian Tern California Gull Least Tern No Yes Head-3 Head-4 Head-5 Head-6 Head-7 Head-8 Answer Head-2 Head-1 Heermann Gull!! Do you see yourself? How do you interpret your decision? Test Class Image Candidate Class Image Ring Billed Gull Image Figure 13: INTR's Class-specific query can able to discriminate similar species. The test image (first row) is Heermann Gull and the most similar candidate class (second row) is Ring Billed Gull. We show the cross-attention maps of the ground-truth class image (first row) and the candidate class image (second row) triggered by the test class ground-truth query. The query searches for class-specific attributes in both species. For instance, in Head-1 to Head-4 (purple box), both rows detect the common back, breast, tail, and belly pattern respectively. Head-8 (brown box) detects the red black-tipped bill from the test class but not the yellow ring bill from the candidate class.</p><p>In the main paper, we demonstrate the capability of INTR in detecting tiny image manipulations, focusing on the species Red-winged Blackbird and Orchard Oriole as detailed in Figure <ref type="figure">5</ref>. We further extend our analysis to another species, Scarlet Tanager, in Figure <ref type="figure">11</ref>. Specifically, we modified the Scarlet Tanager by altering its wing and tail colors to red, resembling the Summer Tanager. These alterations were conducted following the attribute guidelines (Cor). To quantitatively measure the effects, we randomly selected ten images from each species for manipulation. Our observations revealed that twenty-nine out of thirty cases resulted in a change in classification postmanipulation, indicating a success rate of 96.7%. This underscores INTR's ability to discern tiny image modifications that differentiate between distinct classes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>How does INTR differentiate similar classes?</head><p>We explore the predictive capabilities and investigate INTR's ability to recognize classes that share similar attributes. In Figure <ref type="figure">12</ref>, we show the top predicted classes by INTR for the test image Heermann Gull. The top ten predictions indeed exhibit similar appearances and genera, indicating the meaningfulness of these predictions.</p><p>We further investigate attributes detected by INTR that are responsible for distinguishing similar species. The attributes that INTR captures are local patterns (specific shape, color, or texture) useful to characterize a species or differentiate between species. These attributes can be shared across species if the species are visually similar. These can be seen in Figure <ref type="figure">13</ref> and Figure <ref type="figure">14</ref>. In Figure <ref type="figure">13</ref>, we applied the query of Heermann Gull to the image of Heermann Gull (first row) and Ring Billed Gull (second row). Since these two species are visually similar, several attention heads identify similar attributes from both images. However, at Head-8, the attention maps clearly identify the unique attribute of Heermann Gull that is not shown in Ring Billed Gull. Please see the caption for details. In Figure <ref type="figure">14</ref>, we present the cross-attention map activated by the ground-truth queries for two closely related species, Baltimore Oriole and Orchard Oriole. Additionally, we manually document some of the attributes by checking whether the attention maps align with those human-annotated attributes in the CUB dataset. This reveals that INTR can identify shared and discriminative attributes in similar classes. Class-specific queries are improved over decoder layers. As mentioned in section 4 and Appendix B, our implementation of INTR has six decoder layers; each contains one cross-attention block. In qualitative results, we only show the cross-attention maps from the sixth layer, which produces the class-specific features that will be compared with the class-agnostic vector for prediction (cf. Equation <ref type="formula">6</ref>). For the cross-attention blocks in other decoder layers, their output feature tokens become the input (query) tokens to the subsequent decoder layers. That is, the class-specific queries will change (and perhaps, improve) over layers.</p><p>To illustrate this, we visualize the cross-attention maps produced by each decoder layer. The results are in Figure <ref type="figure">15</ref>. The attention maps improve over layers in terms of the attributes they identify so as to differentiate different classes.  Head-1 Head-2 Head-3 Head-4 Head-5 Head-6 Head-7 Head-8 Answer Image Bali Starling!! Do you see yourself? How do you interpret your decision? Yes Yes Yes Yes Figure 16: Illustration of INTR. We show four images (row-wise) of the same bird species Bali Starling and the eight-head cross-attention maps (column-wise) triggered by the query of the ground-truth class. Each head is learned to attend to a different (across columns) but consistent (across rows) semantic cue in the image that is useful to recognize this bird species. Head-1 Head-2 Head-3 Head-4 Head-5 Head-6 Head-7 Head-8 Answer Image Paititia Neglecta!! Do you see yourself? How do you interpret your decision? Yes Yes Yes No Figure 17: Illustration of INTR. We show three images (row-wise) of the same butterfly species Paititia Neglecta and the eight-head cross-attention maps (column-wise) triggered by the query of the ground-truth class.</p><p>Each head is learned to attend to a different (across columns) but consistent (across rows) semantic cue in the image that is useful to recognize this butterfly species.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Even though CT applies cross-attention, it uses image patches as queries to attend to the concept embeddings; the outputs of cross-attention are thus features for image patches.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Please see subsection 4.2 for a discussion on concerns about data leakage and unfair comparison.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>In the extreme case, one may consider the weights to be uniform, i.e., 1 N , at all spatial grids for all classes.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>In this paragraph, this refers to the similarity in the inner product space.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>https://zenodo.org/record/3477412</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>This field guide presents the idea of highlighting key features from images to identify species and distinguish them from closely related species. In the field guide, Peterson used expert knowledge to draw a synthetic representation of the species and pointed arrows to the key features that would focus a birder's attention when in the field to a few defining traits that would help the observer to correctly identify it to species.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>Please be referred to (Cor) and (Bir).</p></note>
		</body>
		</text>
</TEI>
