<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Robustness Analysis for Convolutional Neural Networks with Uncertainty Quantification</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10259931</idno>
					<idno type="doi"></idno>
					<title level='j'>Proc. of the International Forum on Signal Processing</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>L. Mihaylova M. Javed</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper presents a novel framework for training convolutional neural networks (CNNs) to quantify the impact of gradual and abrupt uncertainties in the form of adversarial attacks. Uncertainty quantification is achieved by combining the CNN with a Gaussian process (GP) classifier algorithm. The variance of the GP quantifies the impact on the uncertainties and especially their effect on the object classification tasks. Learning from uncertainty provides the proposed CNN-GP framework with flexibility, reliability and robustness to adversarial attacks. The proposed approach includes training the network under noisy conditions. This is accomplished by comparing predictions with classification labels via the Kullback-Leibler divergence, Wasserstein distance and maximum correntropy. The network performance is tested on the classical MNIST, Fashion-MNIST, CIFAR10 and CIFAR 100 datasets. Further tests on robustness to both black-box and white-box attacks are also carried out for MNIST. The results show that the testing accuracy improves for networks that backpropogate uncertainty as compared to methods that do not quantify the impact of uncertainties. A comparison with a state-of-art Monte Carlo dropout method is also presented and the outperformance of the CNN-GP framework with respect to reliability and computational efficiency is demonstrated.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Robustness in artificial intelligence (AI) is related to reliability and explainability, especially when deep neural networks (DNNs) are applied in uncertain environments <ref type="bibr">[1]</ref>. DNNs operate by sequentially learning complex representations by layers of linear computations followed by non-linear transformations. This form of hierarchical learning has since the previous decade of AI witnessed a giant leap in accuracy, with systems achieving near human-level performance on tasks such as image classification <ref type="bibr">[2]</ref>. The first half of this decade saw a surge of machine learning algorithms which encouraged the development of DNNs that not only predict but also quantify the impact of uncertainties over their predictions <ref type="bibr">[3]</ref>. Although it is difficult to foresee what the next big leap of AI is going to be, there is now a growing motivation towards developing AI systems that are robust to adversarial attacks <ref type="bibr">[4]</ref>.</p><p>Developing robust AI systems entails plenty of challenges. These include tackling human user errors, misspecified goals, incorrect models and unmodeled phenomena <ref type="bibr">[5]</ref>. Adversarial attacks can be of two types: black box or white box <ref type="bibr">[6]</ref>. These attacks challenge the network's learned capabilities. Black-Mahed Javed and Lyudmila Mihaylova are with the Department of Automatic Control and Systems Engineering, University of Sheffield, UK (e-mail: mjaved1@sheffield.ac.uk, l.s.mihaylova@sheffield.ac.uk ).</p><p>box attacks only have access to the inputs of the network. White-box attacks <ref type="bibr">[6]</ref> on the other hand, have full access to the DNN architecture; the inputs, outputs and the gradient information in each of the nodes. Misspecified goals often arise because the original intended AI system design goals do not meet the end-user goals. The reverse of this situation results in incorrect models. Another reason for incorrect model occurrence is also the lack of representation of model uncertainty. If a model is more uncertain at solving the problem, likely it is not suitable for the task. Model uncertainty is also referred to as epistemic uncertainty <ref type="bibr">[7]</ref>.</p><p>Finally, unmolded phenomenon challenges arise because not all AI systems can incorporate prior knowledge of everything in the environment. This phenomenon is also known as aleatoric uncertainty and is present within the inputs of the AI system <ref type="bibr">[7]</ref>. Accounting for uncertainty in AI systems will also improve its explainability since it allows the model to explicate its predictions. This is also essential for critical decision-making systems. Previous approaches to building robust AI systems rarely considered such aspects. This is the research challenge that this paper focuses on.</p><p>This paper explores the possibility of building a robust AI system with only two convolutional layers and validates it on both white and black-box attacks. The tests are carried on relatively simple datasets MNIST <ref type="bibr">[8]</ref> and FMNIST <ref type="bibr">[9]</ref>, as well as on complex datasets CIFAR10 <ref type="bibr">[10]</ref> and on large dataset CIFAR100 <ref type="bibr">[10]</ref>. The main idea is to use similarity cost as a tool to backpropagate the uncertainty information. This has a regularization effect on the loss functions. The entire problem of learning from uncertainty is casted as an example of backpropagation. The proposed framework trains a simple convolutional neural network (CNN) <ref type="bibr">[11]</ref> feature extractor with a Gaussian process (GP) classifier <ref type="bibr">[12]</ref> at a higher level. The GP is introduced for two purposes, one to characterize uncertainty and second to use the features from CNN for classifying the input images. The uncertainty is characterized by the variance of the GP. The CNN model transforms large complex input spaces to simple, low dimensional features for the GP to interpret.</p><p>The CNN-GP training is carried out in two stages: backpropagation of epistemic uncertainty and then of aleatoric uncertainty. The validation results demonstrate that these two stages influence each other and cannot achieve good results as isolated training materials. The main contributions of this work are highlighted below.</p><p>&#8226; A CNN-GP framework is proposed for classification and uncertainty quantification. The framework performance is validated both with gradual and abrupt uncertainties (random attacks in the data) and is compared with a state-of-the-art approach with dropout &#8226; The framework has been extensively tested on four types of datasets with increasing complexity; MNIST, FMNIST, CIFAR10 and CIFAR100. The framework demonstrates that backpropagation of uncertainty is vital for developing CNNs and DNNs with strong robustness against black-box and white-box adversarial attacks</p><p>&#8226; The uncertainty quantification is based on the GP variance. The analysis charts that show reduced uncertainty in predictions. Precision-recall and receiver operating characteristics (ROC) curves characterize the accuracy of the results</p><p>&#8226; The proposed framework provides reliable uncertainty estimates and has an increased computational efficiency compared with a state-of-art Monte Carlo dropout approach <ref type="bibr">[23]</ref>. The validation is performed with increasing strength of the black-box and white-box attacks</p><p>&#8226; The paper shows that explainable AI is linked to robust AI, such that robustness in AI can be achieved by accounting for uncertainty measures in both the model and datasets</p><p>The rest of the paper is organized as follows. Section II gives a brief overview of recent methods from the fields of meta-learning and adversarial learning in deep learning. Section III presents the proposed framework. Followed by Section IV on robustness analysis and tests on the accuracy of the framework. These tests include black-box and whitebox attacks on four different datasets varying in complexity and size. The propagated uncertainty is analyzed with the GP variance, precision-recall and ROC curves. The variance information is further tested with the increase of attack strength. Section V presents discussion of the results and finally ends with the section on future works in Section VI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORKS</head><p>Learning from uncertainty is an actively developing field in Bayesian deep learning. It is practised in many forms and under several learning monikers, of the most popular ones being meta-learning <ref type="bibr">[13]</ref> and adversarial learning <ref type="bibr">[14]</ref>. Metalearning treatment of uncertainty-based learning consists of recognizing the fact that learning from uncertainty is a "meta" step operating on top of the main learning step (i.e. backpropagation of gradients). On the other hand, adversarial learning treats uncertainty as means for generating attacks that may be black or white-box. There is a plethora of techniques in both regimes <ref type="bibr">[15]</ref> as well as defence strategies. However, there are a few that leverage uncertainty. Amongst these are the works of <ref type="bibr">[1]</ref>, which focus on the detection of attacks, while <ref type="bibr">[16]</ref> and <ref type="bibr">[17]</ref> focus more on their mitigation. There have even been some methods that merged the two fields. For example, in <ref type="bibr">[18]</ref>, a generative adversarial network (GAN) based discrimination is used to backpropagate epistemic uncertainty. In this section, we study the literature and compare the latest techniques to our approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Comparison of Current Approaches</head><p>Uncertainty related research in meta-learning is usually adopted in semi-supervised tasks. These tasks entail learning from a dataset with limited labels. This is also carried out in noisy, uncertain conditions. Examples of this in literature can be seen practised in <ref type="bibr">[19]</ref> and <ref type="bibr">[20]</ref>. The main difference in the individual approaches is that <ref type="bibr">[19]</ref> adopts a global averaging scheme on DNN weights as a means of modelling noise in the labels, while <ref type="bibr">[20]</ref> generates an external noise model and a student-teacher learning scheme to teach their network to be consistent in predictions. Methods that involve external noise generation do not require alteration of their training architecture. They are also easy to scale.</p><p>Research in the field of adversarial learning, <ref type="bibr">[16]</ref> and <ref type="bibr">[17]</ref>, aim to reduce the effects of adversarial attacks. Major differences between the approaches are that <ref type="bibr">[17]</ref> uses a GAN to train their main network to resist attacks while <ref type="bibr">[21]</ref> and <ref type="bibr">[22]</ref> uses Bayesian methods. Specifically, <ref type="bibr">[21]</ref> uses softmax variance to account for uncertainty while <ref type="bibr">[22]</ref> uses Monte Carlo (MC) dropout. MC dropout quantifies uncertainty by sampling via multiple forward passes and then computing the variance from these samples <ref type="bibr">[23]</ref>. GAN methods, on the other hand, don't discriminate between black-box or white-box attacks. Therefore, such methods are flexible and applicable to any form of classifier. MC dropout, on the other hand, can scale well with network architecture but at the price of computational cost. Additionally, <ref type="bibr">[21]</ref> shows that softmax variance is an approximation to the measure of mutual information. Comparing this with predictive entropy obtained from MC dropout, it is proved by <ref type="bibr">[21]</ref> that mutual information is more informative at detecting attacks. Here, information criteria characterize how well the uncertainty is represented and its sensitivity to adversarial attacks.</p><p>The drawbacks of these approaches <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref> are that GAN based methods are difficult to train since they involve optimizing of two separate DNN models (discriminator and generator). The MC dropout is relatively slow at uncertainty computation and the quality of the uncertainty measure is dependent on the sampling rate. Another important factor is the issue of calibration. Both GAN and MC dropout methods have poorly calibrated representation of uncertainty as opposed to the better calibrated softmax variance in <ref type="bibr">[21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Comparison with Proposed Methods</head><p>The aforementioned techniques <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref> and <ref type="bibr">[21]</ref> provide sound solutions in uncertainty-based robustness. However, they only consider the forward propagation of uncertainty. In this work, we confirm the theory posed by <ref type="bibr">[1]</ref> and improve the methods by both <ref type="bibr">[16]</ref> and <ref type="bibr">[18]</ref> which are indirectly accomplishing backpropagation of uncertainty. The framework, proposed in this paper, is faster than <ref type="bibr">[16]</ref> and <ref type="bibr">[18]</ref> and less computationally expensive. This is because GANs require training two separate networks, and the MC dropout methods require long sampling time. The proposed framework uses a Gaussian process classifier that allows fast quantification of uncertainties. By backpropagating the uncertainty information, it is possible to reduce the uncertainty in the predictions as well as improve the sensitivity of the framework to adversarial attack strength.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROPOSED FRAMEWORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Notations</head><p>This subsection describes the main notations (in Table <ref type="table">I</ref>) used in this paper and especially in the CNN-GP framework, described in Section IV. The next subsections introduce both the CNN and the GP parts of our proposed framework. Dealing separately their formal definitions and descriptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Convolutional Neural Networks</head><p>CNNs are a specific type of neural networks that learn features from images in a hierarchical fashion <ref type="bibr">[11]</ref>. The main idea is to use convolutional kernels that adapt to the input image. Given a loss function, learning in CNNs is performed by differentiating the outputs w.r.t the loss function and updating the weights of each kernel by adding on the scaled value (via learning rate &#120574;) of this gradient.</p><p>The proposed framework combines a CNN feature extractor and a GP after it, in one architecture (Figure <ref type="figure">1</ref>). The CNN has two convolution layers of 32 and 64 filters of 3x3 kernel size. The padding size of convolutional layers varies. This is because MNIST and FMNIST datasets share the same input size of 28x28x1 as opposed to CIFAR10 and CIFAR100 i.e. 32x32x3. For MNIST and FMNIST padding size is set to 2 and 1 for CIFAR10 and CIFAR100. A maxpooling layer is introduced between the second layer and the final fully-connected layer. Pooling layers downsample the features and dropout layers are used as a regularizer. The fully connected layer, on the other hand, flattens the features to a 128x10 (for MNIST and FMNIST, 128x16 for CIFAR10, 128 x 100 for CIFAR100) feature vector. These features are then fed to the GP half of the framework discussed in the next subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Gaussian Process</head><p>A Gaussian Process is a Bayesian nonparametric approach <ref type="bibr">[12]</ref> which can represent highly nonlinear phenomena. The GP approach models a distribution over functions. Learning a GP is similar to learning in CNNs, in the sense that it involves a kernel learning process. However, the choice of the kernel and respectively the likelihood function is problem-dependent. In our framework, we use a squared exponential kernel for the kernel choice and a softmax likelihood for squashing the posterior mean of the output distribution to probabilities. For the choice of the GP model, we use Massively Scalable Gaussian Processes (MSGP).</p><p>MSGPs are the preferred methods for many applications, thanks to their scalability. MSGPs were introduced in <ref type="bibr">[24]</ref> and have celebrated achievements in sparse GP models with inducing points. The computational load of computing the inverse of the covariance matrix is reduced by using an eigendecomposition of the covariance matrix to a series of Toeplitz matrices.</p><p>Within the architecture, the output from the GP is a categorical distribution, from which a 1x&#119873; vector (&#119873; is the batch size) is then estimated via maximum likelihood.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Training Algorithm for the Proposed Framework</head><p>The training algorithm for the proposed framework consists of two halves a) backpropagation of epistemic uncertainty and b) backpropagation of aleatoric uncertainty. Both are carried out independently. In step a), the prediction from the GP classifier is compared with the labels using the maximum likelihood &#8466; &#119898;&#119886;&#119909; . The error obtained is then backpropagated by the parameters of the GP (the lengthscale &#955; and amplitude &#120590; &#119864; ) and the CNN (convolutional layers). For inferring, we use the approximated variational inference since categorical likelihood is used for the classification.</p><p>In step b), synthetic training samples are created. This step is inspired by the work of <ref type="bibr">[20]</ref> where randomly sampled minibatches are ranked. This is proceeded by the random selection stage where the top &#119896; nearest neighbors of the mini-batch samples are selected to replace the original samples. These synthetic samples are then fed to both the CNN and GP. Similarly, the loss function &#8466; &#119866;&#119875; is used to backpropagate aleatoric uncertainty by encouraging GP classifier to remain consistent in its predictions. These losses encourage the development of noise-tolerant weights and also have a regularization effect. Three functions characterize similarity losses: a) the <ref type="bibr">Kullback</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#120590; &#770;&#119894;2</head><p>Aleatoric variance / uncertainty for the i th batch</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#120575;&#119909; &#119894;</head><p>The difference between the i th data point and the GP prediction</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#119891; &#119866;&#119875;</head><p>The Gaussian process function</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#119891; &#119862;&#119873;&#119873;</head><p>The convolutional neural network function</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#119910; &#770;&#119894; &#119862;&#119873;&#119873;</head><p>Softmax prediction from the CNN base feature extractor</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#119910; &#770;&#119911;&#119862;&#119873;&#119873;</head><p>Prediction from the z th node from the CNN base feature extractor</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8466; &#119898;&#119886;&#119909;</head><p>Maximum likelihood loss</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8466; &#119866;&#119875;</head><p>Similarity loss penalizing output from the GP classifier and labels and provide the full algorithm description below. The notations that are used in the algorithm section are also provided in Section III. Algorithm 1 presented below summarizes the implemented CNN and GP framework for characterizing the uncertainties. Different loss functions are used and these are described in Section IV. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Loss Functions</head><p>Consider two sets of probability mass functions &#119901;(&#119909;) and &#119902;(&#119909;) that take a data point &#119909;. Finding the shift of mass from one set to the other requires calculating the discrepancy between the two. The Kullback-Leibler divergence <ref type="bibr">[25]</ref> &#119863; &#119870;&#119871; , shown in <ref type="bibr">(1)</ref>, represents this discrepancy as a measure of entropy. It quantifies the shift of probability mass by taking the difference of entropy across the distributions.</p><p>The Wasserstein distance <ref type="bibr">[26]</ref> solves the problem from the point of view of optimal transport. These problems are divided into two parts: assignment and cost. The assignment strategy determines how much mass is moved across the supports of the distributions. The cost measures the effort required for the assignment strategy. Both are represented as matrices &#119875; and &#119862;, respectively. The total cost can be obtained by taking the Frobenius inner product of the two (i.e. &#10216;&#119862;, &#119875;&#10217;). The objective then is to obtain the minimum of the product and subtract from the regularized entropy in <ref type="bibr">(2)</ref>. Here, &#120578; is denoted as the regularization term. For our experiments, we choose the default value for &#120578; = 0.1 and a quadratic distance-based cost function as an approximation to the primal Wasserstein distance formulation <ref type="bibr">[26]</ref>.</p><p>Finally, the maximum correntropy loss function <ref type="bibr">[27]</ref> has also been implemented in the backpropagation step. The maximum correntropy loss function uses a kernel to compute the difference across two variables instead of using entropybased methods such as in KLD and Wasserstein functions. The formulation can be seen in equation <ref type="bibr">(3)</ref>. The Gaussian kernel is a popular one:</p><p>) , where &#120590; 2 represents the variance of the distribution. The considered cost functions are given below.</p><p>The term &#119881; &#120590; refers to the MC across two masses &#119901;(&#119909;) and &#119902;(&#119909;) where &#120124; refers to the expected value. This measure has been proven to be less sensitive to outliers. This is found in many second-order statistics measures such as cross-entropy. It is heavily studied in outlier suppression <ref type="bibr">[27]</ref> and is ideally suitable for robust algorithm design. The next Section V presents results and analyses them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. PERFORMANCE VALIDATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Validation Accuracy, Precision-Recall and ROC Curves</head><p>Before the experiments, the CNN-GP classifier is trained with the three different similarity losses. The purpose was to observe the accuracy as a means of performance evaluation. The average results were calculated by dividing the averaged correct samples by the total number of samples. Experiments were run ten times and accuracy values were averaged. The standard deviation was &#177; 2%. Then, the system was disrupted using black-box attacks of two types: a) an additive white Gaussian noise (AWGN) and b) motion blur (MB). The results were compared with the system version where no similarity losses were used (i.e. without regularization). These results are presented in Table <ref type="table">II</ref>. Next, the precisionrecall and the ROC results characterize the accuracy of the proposed CNN-GP framework. These results are plotted for each dataset side to side in Figure <ref type="figure">2</ref>. The average precision (AP) and ROC area are two quantities that are obtained by  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Computational Time</head><p>The computational time of the proposed framework CNN-GP is compared with the MC dropout method <ref type="bibr">[23]</ref>. Both of the models are made to output variance information on simple MNIST input images. The sampling rate for MC dropout method is set to 100. The respective run-time for each is then computed on the University of Sheffield provided GPU cluster (NVIDIA K80). The testing time is measured in minutes and the results are tabulated in Table <ref type="table">III</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. DISCUSSION</head><p>Considering the results from Table <ref type="table">II</ref>, we see that when there is no attack, the CNN-GP configurations that backpropagate both epistemic and aleatoric uncertainties, excluding the case with the Wasserstein metric, perform better than without backpropagation (no regularization). Furthermore, backpropagation of epistemic uncertainty influences the backpropagation of aleatoric uncertainty since the networks perform much worse when each of the processes is done separately (bracketed accuracies represent aleatoric only). This confirms that both stages of the training are necessary for reliable results. This is further demonstrated by uncertainty charts in Figure <ref type="figure">4</ref> where the uncertainty measures of KLD (row 2) and MC (row 4) have lower bar heights for incorrect sample variance (yellow bars) than those for the cases that backpropagates epistemic uncertainty only (row 1).</p><p>We further see that the prediction results with the Wasserstein metric are comparable with the other data, regardless of the attack when tested on the complex CIFAR10 dataset (40%), it performs rather poorly than expected. This agrees with the hypothesis of <ref type="bibr">[29]</ref> which claims that the Wasserstein metric yields biased gradients that have a higher chance in leading to a false local minimum than the KLD during optimization. This may also explain why KLD results on the backpropagation of aleatoric uncertainty are higher (75% and 60%) than the Wasserstein metric (11% and 13%).</p><p>This result may be due to the fact that an approximated version of the Wasserstein metric is implemented. An approximate implementation is performed to avoid the complexity and intractability of computing the infimum of double integrals in the primal version <ref type="bibr">[26]</ref>. This is further supported by the precision-recall diagram for the Wasserstein metric for all attacks which shows that the precision for these methods slowly drop when the dataset complexity is increased (from MNIST to CIFAR10). The downward shift of blue, black and yellow dashed lines in Figure <ref type="figure">2</ref> visualizes these drops.</p><p>In order to characterize the robustness of the approaches, the recall function is calculated. Precision is heavily affected by the uncertainties and impacts the results of all methods. However, the approaches with the MC dropout and KLD maintain a good level of precision despite having poor recalls (e.g. in AWGN attacks for MC and KLD). Hence, it is possible to diagnose the recall aspect as a measure of sensitivity to the attack.</p><p>Then, considering the MC and KLD results, it is evident that using these losses results in high accuracies in motion blurring when compared with the Wasserstein metric results. The performances of the MC and KLD are similar. This is further evident in Figure <ref type="figure">4</ref> where uncertainty charts for both KLD and MC have a greater number of correct sample variance (blue) as compared to those for the Wasserstein metric (row 3). For MC, this was expected since this type of loss is ideal for robust algorithm design. This is further supported in Figure <ref type="figure">2</ref> where the precision-recall for both KLD and MC for motion blurring (MB) remain the highest (solid blue, green and yellow lines) as the dataset complexity increases (MNIST to CIFAR10).</p><p>Regarding the variance sensitivity to attack strength, we can see from Figures <ref type="figure">3A</ref> and<ref type="figure">3C</ref> that CNN-GP trained on the MC similarity loss is more responsive than both KLD as well as the no regularization configuration. This also demonstrates that the MC is suitable for robust algorithm design. The graphs show that both the MC and KLD functions, start with higher confidence in predictions (i.e. low variance) before the attack strength is increased when compared to the case without regularization. This confirms both our hypothesis and our results in Figure <ref type="figure">4</ref> that backpropagation in the CNN-GP framework reduces the impact of uncertainties and attacks on the classification results and characterize the model's confidence. For the MC dropout method, it is seen from both Figure <ref type="figure">3B</ref> and 3C that this model is not representing the uncertainty estimates well when compared with the CNN model. Hence, it is not reliable for uncertainty quantification. The computational complexity of the compared approaches is characterized by Table III which shows that the MC dropout method is much slower than the CNN-GP framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. CONCLUSIONS AND FUTURE WORKS</head><p>This paper proposes a CNN-GP framework that can characterize the impact of uncertainties on the classification results. Three loss functions -the Kulback-Leibler divergence, the Wasserstein distance and the maximum correntropy were embedded in the backpropagation step of the CNN-GP and their performance was compared. The GP layer serves for quantifying the uncertainty, based on the GP variance. A small variance corresponds to a small uncertainty, a high variance means high uncertainty and hence means that the classification result cannot be trusted. The proposed CNN-GP framework is compared with a Monte Carlo dropout and it is shown that the CNN-GP is more efficient than the MC dropout method, especially with respect to computational time. The results show that the models become robust and reliable and can cope with attacks, after learning from uncertainty. The main limitation of the framework is that it is not able to get high accuracies on large and complex datasets e.g. CIFAR10 and CIFAR100. That is pointing to architecture issues more than the algorithm since the state-of-the-art architecture for CIFAR10 uses up to more than 15 convolutional layers <ref type="bibr">[30]</ref>. In future, we will focus on training large complex networks. Also, consider the possibility of feeding the CNN feature extractor as a covariance kernel to the GP. This may be computationally more feasible and may also improve the uncertainty representation in the GP since it will give the GP a holistic view of the impact of the dataset on the performance of the CNN. This work also investigates the relationship between</p></div></body>
		</text>
</TEI>
