<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Gradient Frequency Modulation for Visually Explaining Video Understanding Models</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/22/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10358727</idno>
					<idno type="doi"></idno>
					<title level='j'>British Machine Vision Conference</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>X Lin</author><author>W Bao</author><author>M Wright</author><author>Y Kong</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In many applications, it is essential to understand why a machine learning model makes the decisions it does, but this is inhibited by the black-box nature of state-of-the-art neural networks. Because of this, increasing attention has been paid to explainability in deep learning, including in the area of video understanding. Due to the temporal dimension of video data, the main challenge of explaining a video action recognition model is to produce spatiotemporally consistent visual explanations, which has been ignored in the existing literature. In this paper, we propose Frequency-based Extremal Perturbation (F-EP) to explain a video understanding model's decisions. Because the explanations given by perturbation methods are noisy and non-smooth both spatially and temporally, we propose to modulate the frequencies of gradient maps from the neural network model with a Discrete Cosine Transform (DCT). We show in a range of experiments that F-EP provides more spatiotemporally consistent explanations that more faithfully represent the model's decisions compared to the existing state-of-the-art methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Since the first major success of deep learning in 2012, there has been an explosion of applications of these models, such as in image classification <ref type="bibr">[8]</ref>, machine translation <ref type="bibr">[5]</ref> and tumor detection <ref type="bibr">[30]</ref>. As people interact more frequently with these models, and they are applied in critical applications like medical diagnosis <ref type="bibr">[25]</ref> and terrorism detection <ref type="bibr">[41]</ref>, it is becoming increasingly important to understand their decisions. Otherwise, it can be hard for people to trust the models, since blind trust could result in catastrophic consequences. A plethora of evidence shows the importance of explanation towards understanding and building trust in cognitive psychology <ref type="bibr">[21]</ref>, philosophy <ref type="bibr">[38]</ref> and machine learning <ref type="bibr">[10,</ref><ref type="bibr">20]</ref>.</p><p>Numerous works have emerged, especially in the computer vision field, to explain the decisions of deep neural networks. Researchers have come to a consensus that model explanation should be interpretable and faithful to the model <ref type="bibr">[14,</ref><ref type="bibr">28,</ref><ref type="bibr">37]</ref>. A common type of model explanation in the image domain takes the form of a heatmap or saliency map that localizes the salient areas of an input for a model's decision. Among these approaches, perturbation-based methods <ref type="bibr">[11,</ref><ref type="bibr">12,</ref><ref type="bibr">42]</ref> produce explanations that are more faithful and finegrained than CAM-and backpropagation-based methods <ref type="bibr">[22,</ref><ref type="bibr">28,</ref><ref type="bibr">35,</ref><ref type="bibr">48]</ref>, but also produce noise that appears to humans as randomly selected, which hurts their interpretability. See Fig. <ref type="figure">6</ref> for a comparison of these explanation methods.</p><p>Another challenge arises when the aforementioned methods for the image domain are leveraged to the video domain: the explanations fail to capture the motion dynamics. We argue that spatiotemporal consistency should be added as another criterion for model explanation in the video domain. A spatiotemporally consistent explanation should focus on the target object/scene and follow its spatiotemporal trajectory, such as the gymnast in Fig. <ref type="figure">6</ref>. This greatly helps the interpretability of the explanation. At the same time, the video understanding models may indeed not pay attention to the target object/scene at each frame because of the moving edges, shot changes, and other issues <ref type="bibr">[39,</ref><ref type="bibr">40]</ref>. Spatiotemporally consistent explanations help the end-users or the developers of models to better assess the model's understanding of the dataset and improve upon it. If the explanations are faithful to the model/data, they may not be spatiotemporally consistent; and if they are more spatiotemporally consistent, they may lose faithfulness. Thus, the challenge resides in finding a balance between the faithfulness of an explanation and its interpretability.</p><p>Therefore, we propose the Frequency-based Extremal Perturbation method (F-EP) which extends the EP method <ref type="bibr">[11]</ref> to the video domain. F-EP, illustrated in Fig. <ref type="figure">1</ref>, aims to find the salient areas of the input by perturbing a mask, where the perturbations are the gradients of the output label with respect to the input. The masks, comprised of the aggregated perturbations, become the explanations for the video input. At each iteration of the optimization process, F-EP transforms the gradients into the frequency space using a common Fourier Transform: Discrete Cosine Transform (DCT). Then, F-EP modulates the frequency signals and transforms them back to the gradient space to update the masks.</p><p>This approach is inspired from works <ref type="bibr">[43,</ref><ref type="bibr">45]</ref> showing that neural networks not only learn the low-frequency components of an image, such as faces and shapes, but also the high-frequency components that appear like noise to humans. Therefore, the gradients of the salient high-frequency components will be high and make the gradient maps noisy. To overcome this problem, F-EP proposes to transform the gradients into the frequency domain and selects a combination of low and high frequency components of the gradients to achieve semantically meaningful explanations that are also faithful to the model. Meanwhile, because the explanations focus on the semantic features of each frame, such as a person performing an action, and the majority of frames contain these features, the explanations will naturally become spatiotemporally consistent.</p><p>Since no existing metrics address spatiotemporal consistency of explanations, we propose the Spatiotemporal Consistency (STC) metric to measure how well the explanations align with the ground truth bounding boxes. Because bounding boxes indicate the spatial position of the foreground object/scene, spatiotemporal consistent explanations will have high alignment with them. We experimentally show that F-EP achieves superior performance in terms of spatiotemporal consistency and faithfulness than the state-of-the-art methods. In summary, our contributions in this paper are as follows.</p><p>&#8226; We propose the F-EP method, which can simultaneously reduce the noise in the explanations and make them spatiotemporally consistent by modulating the gradients in the frequency space.</p><p>&#8226; We propose a new metric, Spatiotemporal Consistency (STC), to accurately assess the spatiotemporal consistency of an explanation in the video domain. &#8226; In the ablation studies section, we show that F-EP using low frequency gradients is able to achieve state-of-the-art performance, while the addition of high-frequency gradients can improve its performance even further.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Model explainability is increasingly popular in a diverse range of deep learning applications.</p><p>In this section, we present only the methods for image and video recognition models.</p><p>Class Activation Map (CAM)-based. CAM-based methods, such as CAM <ref type="bibr">[48]</ref>, Grad-CAM <ref type="bibr">[28]</ref>, Grad-CAM++ <ref type="bibr">[4]</ref>, and Score-CAM <ref type="bibr">[44]</ref>, use weighted sums of either activation maps or gradients of the class label with respect the input to produce a heatmap. Although CAM-based methods are efficient <ref type="bibr">[9,</ref><ref type="bibr">27]</ref>, usually requiring just a few forward and backward passes, the resulting heatmaps are coarse due to the upsampling process. The Saliency Tubes approach <ref type="bibr">[36]</ref> adapts Grad-CAM to the video domain.</p><p>Backprop-based. Backprop-based methods <ref type="bibr">[2,</ref><ref type="bibr">13,</ref><ref type="bibr">22,</ref><ref type="bibr">31,</ref><ref type="bibr">33,</ref><ref type="bibr">37,</ref><ref type="bibr">46,</ref><ref type="bibr">47]</ref> compute saliency scores for each input pixel by starting with the final layer and, one layer at a time, inferring the importance of the inputs to the outputs of the layer until the input is reached. For example, Layerwise Relevance Backpropagation (LRP)-based methods <ref type="bibr">[13,</ref><ref type="bibr">22]</ref> assign a relevance score for each neuron and backpropagate the relevance score of each neuron to the input layer. The explanations of backprop-based methods have been criticized for being similar to the results of an edge detector <ref type="bibr">[1]</ref> and insensitive to randomization of the model's parameters, which should intuitively alter the results <ref type="bibr">[32]</ref>. Perturbation-based. Perturbation-based explanation methods <ref type="bibr">[11,</ref><ref type="bibr">12,</ref><ref type="bibr">18,</ref><ref type="bibr">23,</ref><ref type="bibr">24,</ref><ref type="bibr">31,</ref><ref type="bibr">42]</ref> add perturbations to the input features to quantify the importance of each pixel. For example, RISE <ref type="bibr">[23]</ref> measures the importance of pixels by randomly masking areas of input features and observing changes to the output. Meaningful Perturbation (MP) <ref type="bibr">[12]</ref> and Extremal Perturbation (EP) <ref type="bibr">[11]</ref> methods optimize masks as explanations to localize salient areas. STEP <ref type="bibr">[18]</ref> extends EP to the video domain by adding spatiotemporal smoothness constraints, and it is the current state-of-the-art for explaining video understanding models.</p><p>Although perturbation-based methods <ref type="bibr">[11,</ref><ref type="bibr">18]</ref> are able to produce more fine-grained explanations and are more sensitive to changes in the model's weights compared to the CAMbased and backpropagation-based methods <ref type="bibr">[4,</ref><ref type="bibr">28,</ref><ref type="bibr">31,</ref><ref type="bibr">37]</ref>, their explanations are still noisy and not spatiotemporally consistent (see Fig. <ref type="figure">6</ref>). Our proposed F-EP approach probes the frequency signals in the gradients in order to denoise the explanations and make them focus more tightly on the target object in every frame. Fourier Transforms. Fourier transforms decompose a data point in the spatial or temporal domain into a combination of functions in the corresponding frequency domain. Discrete Cosine Transform (DCT) is a type of Fourier Transform using only real numbers. DCT was first proposed for efficient image compression, where only the content-defining features are preserved <ref type="bibr">[43]</ref>. DCT is used in machine learning for image compression with SVM <ref type="bibr">[17]</ref>, image classification <ref type="bibr">[26]</ref>, and more. Recently, <ref type="bibr">[15,</ref><ref type="bibr">29]</ref> explore the effectiveness of using DCT on crafting adversarial samples. Although F-EP is also using perturbation to produce explanations, the perturbations are added on the masks and not the inputs as in <ref type="bibr">[15,</ref><ref type="bibr">29]</ref>.</p><p>The optimization objectives are also different: adversarial attacks aim to generate samples to change the model's prediction, while explanations aim to faithfully represent a model's decision. Also, the adversarial samples are sparse, noisy and hard to interpret, while the explanations given by F-EP focus tightly on the target object/scene spatially and temporally.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology</head><p>In this paper, F-EP is experimented on video classification models as in EP <ref type="bibr">[11]</ref>. A future research direction could be to evaluate F-EP on other types of video understanding models.</p><p>Denote the video classification model by &#934;, the original input video clip as X &#8712; R T &#215;C&#215;H&#215;W and the predicted label y = &#934;(X), where y is among a set of Y classes, T is the number of frames in a video clip, H and W are the height and width of the frame/mask, and C is the number of channels. We first briefly introduce the baseline method, Extremal Perturbation (EP) <ref type="bibr">[11]</ref>, and then explain our proposed Frequency-based Extremal Perturbation (F-EP) method, which is shown in Fig. <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Extremal Perturbation</head><p>Extremal Perturbation (EP) was proposed to explain image understanding models and aims to localize the salient areas in the input for a model's decision through perturbations on the masks. Let the masks be M &#8712; R T &#215;1&#215;H&#215;W , the optimization objective of EP is:</p><p>The left term in Eq. ( <ref type="formula">1</ref>) maximizes the classification confidence of the model &#934; to the perturbed input (M &#8855; X). The operator &#8855; is the perturbation operator which applies local Gaussian blurring to each pixel in M. The second term in Eq. ( <ref type="formula">1</ref>) is R a (M) = vecsort(M)r a 2 , which regularizes the area of the mask M with respect to the area constant a. &#955; is the regularizer constant. r a is a binarized vector with a ones and (1a) zeros, and vecsort(M) is sorted vector of the masks M. Thus, R a (M) computes the loss of the area of masks with respect to the area constraint a. To solve the optimization problem (1), we can derive the gradients of the output class label with respect to the masks where &#949; is the learning rate of the optimization (details in <ref type="bibr">[11]</ref>):</p><p>The optimal solution M * thus can be approximated by an iterative gradient ascent process.</p><p>In each iteration, the masks M are updated as follows:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Frequency-based Extremal Perturbation</head><p>Motivation. A model explanation is expected to be interpretable by people and faithfully represent a model's inference process <ref type="bibr">[14]</ref>. This becomes a challenge when the neural network not only uses low-frequency features for learning larger shapes and objects like faces, but also high-frequency features that look like noise <ref type="bibr">[45]</ref>. Because EP aims to achieve higher model confidence regardless of the the frequencies chosen, thus the high frequency features are included if they are more salient than the low frequency features which make explanations noisy and not spatiotemporally consistent.</p><p>Because the gradients encode the importance of each pixel (Eq. ( <ref type="formula">2</ref>)), important highfrequency features would have high gradients in &#8711;M, making &#8711;M noisy. A noisy &#8711;M leads to noisy explanations M, since the latter is iteratively updated with the former, as shown in Eq. <ref type="bibr">(3)</ref>. We propose to transform the gradients into the frequency domain using the Discrete Cosine Transform (DCT), such that the signals related to the content-defining features, which reside in the lower end of the spectrum of the frequencies <ref type="bibr">[43]</ref>, are preserved. Highfrequency signals are not only associated with noise, but also the fine-grained details such as textures/edges. Thus, the low frequency signals allow the explanations to focus on the target object/person and become spatiotemporal consistent, while the high frequency features make the explanations more faithful to the model and fine-grained, see Fig. <ref type="figure">3</ref>. The Fig. <ref type="figure">2</ref> shows that the frequency modulated gradient maps focus more on the foreground object and become spatiotemporal consistent when consecutive frames contain the target object. In the following section, we propose our method Frequency-based Extremal Perturbation (F-EP) and theoretically demonstrate that modulating gradients is equivalent to modulating the masks.</p><p>Gradient Frequency Modulation. F-EP is presented in Fig. <ref type="figure">1</ref>. At each iteration of the optimization, F-EP transforms the gradient maps &#8711;M into the frequency domain using DCT. DCT is chosen over Discrete Fourier Transform (DFT) because the basis functions used in DFT are complex-valued, while &#8711;M only contains real numbers, meaning that DFT would compute useless transformations <ref type="bibr">[3]</ref>. Because &#8711;M is three dimensional, the DCT is also three dimensional. Denote the DCT as H and the frequency map of the gradient &#8711;M as G &#8712; R H&#215;W &#215;T , the (i, j, k)-th spatio-temporal entry of G (i &#8712; {1, . . . , H}, j &#8712; {1, . . . ,W }, and k &#8712; {1, . . . , T }) is computed by the three-dimensional DCT:</p><p>where a = 2</p><p>is the vector form of the element-wise product between the vector c x and d x,i , and h i . Thus, Eq. ( <ref type="formula">4</ref>) shows that the DCT operation is an intrinsically linear system, and according to the gradient ascent rule in Eq. (3), we have the following equality:</p><p>This equation shows that applying frequency modulation on the gradient map &#8711;M is equivalent to frequency modulation on the optimal mask M * . We provide a more detailed derivation in the supplementary material. Therefore, it is straightforward to linearly modulate the gradient frequency map G. To this end, we perform the frequency modulation on G as follows:</p><p>where H is the inverse DCT (IDCT) function and is the element-wise product. Note both H are linear systems. The low frequency mask A &#8712; R H&#215;W &#215;T is defined as A i, j,k = 1 if i &#8804; r l * H, j &#8804; r l * W, k &#8804; r l * T , otherwise zero, and the high frequency mask B &#8712; R H&#215;W &#215;T is defined as</p><p>See the A and B matrices in Fig. <ref type="figure">1</ref>. r l and r h denote the ratios of low-and high-frequency components to be preserved, respectively. Note that r l , r h &#8805; 0 and r l + r h &#8804; 1.</p><p>Eq. ( <ref type="formula">6</ref>) first selects a ratio r l of low-end frequencies and a ratio r h of high-end frequencies from G, then performs IDCT on the selected high and low frequencies separately. By summing up the gradients maps modulated at different frequencies, we ensure that the gradient information contains both the low-end and high-end features.</p><p>Finally, instead of using the raw gradient &#8711;M in equation ( <ref type="formula">3</ref>), we propose to use the frequency-modulated gradient &#8711; M such that:</p><p>Since DCT and IDCT are computationally efficient, our proposed modulation does not incur too much computational cost.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiment</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental Settings</head><p>Evaluation Metrics. Evaluation has always been one of the most challenging parts of studying model explainability. Due to the lack of ground truth explanations, existing methods can only be evaluated in a post hoc manner. We follow prior work in using two standard metrics that also apply to static images. First, Drop in Confidence (DC) measures the decrease in model's confidence on the explanations compared to the original input <ref type="bibr">[4]</ref> (smaller DC is better). It is represented as &#8721; N i max(0, yy e )/N where y e = &#934;(X e ) and X e = M X as in Eq. ( <ref type="formula">1</ref>). Second, Accuracy (Acc.) measures the model's classification accuracy on the explanations X e .</p><p>In addition, since no existing evaluation metric is suitable to assess the spatiotemporal consistency of an explanation, we propose a new metric called Spatiotemporal Consistency (STC). STC calculates the overlap between the explanations and the ground truth bounding boxes. Mathematically, we define</p><p>are the ground truth bounding boxes where O i, j,k = 1 for all pixels inside the bounding boxes, and 0 otherwise. &#964; denotes the threshold of masks M during evaluation because M are continuous values within range [0, 1]. Because the bounding boxes focus tightly on the foreground object/person in each frame, a more spatiotemporally consistent explanation will have higher overlap with them.</p><p>Experimental Details We evaluate F-EP against CAM-based methods: Grad-CAM <ref type="bibr">[28]</ref> and Grad-CAM++ <ref type="bibr">[4]</ref>, and backpropagation-based methods: Gradients <ref type="bibr">[31]</ref>, Integrated Grad <ref type="bibr">[37]</ref> and Smooth Grad <ref type="bibr">[33]</ref>. We extend the original implementation of these methods<ref type="foot">foot_0</ref> to the video domain. EP <ref type="bibr">[11]</ref> was extended to the video domain in STEP <ref type="bibr">[18]</ref> <ref type="foot">foot_1</ref> . The target models to be explained are R(2+1)D <ref type="bibr">[40]</ref> and TSM <ref type="bibr">[19]</ref>. <ref type="foot">3</ref> For video datasets, we use UCF101-24 <ref type="bibr">[34]</ref> and Epic-Kitchens-Object <ref type="bibr">[7]</ref>. UCF101-24 is a video understanding dataset of 24 actions with annotated bounding boxes and Epic-Kitchens-Object records cooking activities from a first-person point of view. The hyperparameters used are the same as in STEP <ref type="bibr">[18]</ref>, except the step size is 13 and sigma is 23. The parameters r l and r h are selected based on the best performance on the validation set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Quantitative Results</head><p>UCF101-24 Results. Table <ref type="table">1</ref> presents F-EP against SOTA methods on the UCF101-24 <ref type="bibr">[34]</ref> dataset based on two target models, R(2+1)D <ref type="bibr">[40]</ref> and TSM <ref type="bibr">[19]</ref>. The outperformance of F-EP (r l = 0.5, r h = 0.2) on the R(2+1)D model shows that it is able to produce more faithful and spatiotemporal consistent explanations than all the other methods, including the baselines EP <ref type="bibr">[11]</ref> and STEP <ref type="bibr">[18]</ref>. On TSM, F-EP (r l = 0.5, r h = 0.1) achieves best on the STC metric, and second best on the DC and Acc. metrics. The first reason is that Figure <ref type="figure">4</ref>: DAUC of all the compared methods, which plots the model's confidence drop with respect to the percentage of pixels deleted. Note that the most salient pixels are deleted first. Lower Area Under the Curve (AUC) is better. F-EP has the lowest AUC compared to other methods, which indicates that the features contained in the explanations are more salient. </p><p>Gradients <ref type="bibr">[31]</ref> 89.4 6.0 0.4 90.8 2.1 0.6 Integrated Grad <ref type="bibr">[37]</ref> 89.5 6.9 0.5 89.5 4.0 1.2 Smooth Grad <ref type="bibr">[33]</ref> 90.7 Table <ref type="table">1</ref>: Results (%) on UCF101-24 <ref type="bibr">[34]</ref> with R(2+1)D <ref type="bibr">[40]</ref> and TSM <ref type="bibr">[19]</ref> models. The best and second best performing methods are shown in red and blue. Our method F-EP shows consistent improvement over the state-of-the-art methods on different models.</p><p>the UCF101-24 dataset has high scene representation bias <ref type="bibr">[6]</ref>, and the model needs not to look at the actual activity to give a correct prediction. Because the explanations given by F-EP focus more on the person performing the activity, which are at the lower end of the frequency domain, the model will have lower confidence on these explanations. Second, R(2+1)D uses 3D convolution filters to extract spatiotemporal features, while TSM uses 2D convolution filters that capture less temporal information. Thus, a more spatiotemporally consistent explanation will have higher model confidence in R(2+1)D than TSM. Fig. <ref type="figure">4</ref> illustrates the Deletion metric <ref type="bibr">[23]</ref>, which measures the drop in model's confidence when the most important pixels are removed. The importance of pixels are given by the saliency maps. A lower Area Under the Curve (AUC) implies the explanation method is better and the saliency features contained are more faithful to the model. We see that F-EP has the lowest AUC compared to other explanation methods.</p><p>Epic-Kitchens-Objects Results. Table <ref type="table">2</ref> reports the results of R(2+1)D on the Epic-Kitchens-Object <ref type="bibr">[7]</ref> dataset. F-EP (r l = 0.7, r h = 0) performs slightly less well on DC and Acc than Table <ref type="table">2</ref>: Results (%) on Epic-Kitchens <ref type="bibr">[7]</ref> with R(2+1)D <ref type="bibr">[40]</ref>. The best and second performance are shown in red and blue. The explanations given by F-EP are more spatiotemporal consistent than SOTA method while being comparable in the metrics of faithfulness (DC and Acc.).</p><p>Method DC (&#8595;) Acc. (&#8593;) STC (&#8593;) Gradients <ref type="bibr">[31]</ref> 48.1 16.0 0.4 Integrated Grad <ref type="bibr">[37]</ref> 49.0 5.5 0.4 Smooth Grad <ref type="bibr">[33]</ref> 49.3 6.1 0.8 Grad-CAM <ref type="bibr">[28]</ref> 28.3 48.0 27.2 Grad-CAM++ <ref type="bibr">[4]</ref> 54.  Grad-CAM, but much better in STC (Qualitative results are shown in the supplemental material). The explanations given by F-EP have lower model confidence because videos in the Epic-Kitchens-Object dataset do not contain the target object in every frame but F-EP produces explanation for each frame. Therefore, non-target objects are included in the explanations which have lower model confidence. A future research direction is to have the optimization algorithm attend only to salient frames for the output decision. Tables <ref type="table">1</ref> and<ref type="table">2</ref> show that backpropagation-based methods, Gradients <ref type="bibr">[31]</ref>, Integrated Grad <ref type="bibr">[37]</ref> and SmoothGrad <ref type="bibr">[33]</ref>, perform poorly compared to CAM-and perturbation-based methods. For example, Fig. <ref type="figure">6</ref>'s second row (more results in the supplement) shows that Integrated Grad's explanations consist of sparse pixels that align poorly with the ground truth. These explanations are often perceived by the model as adversarial samples and have lower model confidence or even incorrect predicted labels <ref type="bibr">[16]</ref>. SmoothGrad and Gradients produce similar explanations and hence face the same problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Ablation Studies</head><p>Without High Frequency. In Fig. <ref type="figure">5</ref>, when r h = 0 and as r l increases, the STC performance increases because the higher frequency signals encode the details such as edges/textures to increase faithfulness of explanations. There is a sharp drop in performance after r l = 0.7, because as &#8711;M contains more and more higher frequency signals, it is overwhelmed with the excess amount of details. The explanations also become noisier and less spatiotemporal consistent, because higher frequency features are preferred over the low frequency features if they could increase more model confidence. Combination of High and Low Frequencies. We conduct experiments with varying r h and deterministic r l = {0.4, 0.5, 0.6}. When r l = 0.3, as r h increases, STC increases because the amount of frequency signals on the lower end frequency spectrum is not enough and higher frequency signals add valuable information to the explanations. However, when r l = 0.4, STC performance increases less quickly than r l = 0.3. This could mean that r l = 0.4 is a threshold at which the utility of high frequency signals decreases. Finally, when r l = 0.5, STC actually decreases when r h increases. Also, F-EP is able to outperform the baseline method EP at multiple combination of frequencies which confirms our hypothesis that not all frequency signals are salient. One future work direction is to search the appropriate ratios r l and r h for each sample because samples vary across dataset and actions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Qualitative Results</head><p>In Fig. <ref type="figure">6</ref>, we see that the explanations produced by F-EP are much easier to interpret than those of the prior works. Integrated Grad explanations are noisy, sparse and difficult to Input FG, [0.63] Grad-CAM <ref type="bibr">[28]</ref> BB, [0.19] Integrated <ref type="bibr">[37]</ref> WB, [0.37] EP <ref type="bibr">[11]</ref> BB, [0.99] STEP <ref type="bibr">[18]</ref> BB, [0.87] F-EP (ours) FG, [0.76] Figure <ref type="figure">6</ref>: Visual comparison of explanation methods on the UCF101-24 dataset <ref type="bibr">[34]</ref>. The input (first row) contains consecutive frames from the activity FloorGymnastics. On the right of each method, the first word is the predicted label on the explanation where FG = FloorGymnastics, BB = BalanceBeam, WB = WritingOnBoard, and the number denotes the predicted probability. interpret, because it directly uses gradients, which contain a substantial amount of noise. Grad-CAM produces explanations that fail to capture the activity dynamics. EP and STEP has difficulty in correctly capturing the person's movement in the consecutive frames. In comparison, F-EP is able to produce explanations that focus on the person performing the activity on each frame (more qualitative results are in the supplement).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper, we propose an explanation method for video understanding models called Frequency-based Extremal Perturbation (F-EP) which aims to produce spatiotemporal consistent explanations that are faithful to the model and interpretable by humans. F-EP transforms the gradients to the frequency domain and modulates the frequency signals in order to preserve those gradient components that help to explain a model's decision. We experiment F-EP on various datasets and models and show that F-EP is able to outperform existing state-of-the-art works.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://github.com/jacobgil/pytorch-grad-cam</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://github.com/shinkyo0513/Video-Visual-Explanations</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>https://github.com/open-mmlab/mmaction2</p></note>
		</body>
		</text>
</TEI>
