<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>An empirical study on evaluation metrics of generative adversarial networks</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018 Spring</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10064659</idno>
					<idno type="doi"></idno>
					<title level='j'>arXiv.org</title>
<idno>2331-8422</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Q Xu</author><author>G Huang</author><author>Y Yan</author><author>C Guo</author><author>Y Sun</author><author>F Wu</author><author>K Weinberger</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Evaluating generative adversarial networks (GANs) is inherently challenging. In this paper, we revisit several representative sample-based evaluation metrics for GANs, and address the problem of how to evaluate the evaluation metrics. We start with a few necessary conditions for metrics to produce meaningful scores, such as distinguishing real from generated samples, identifying mode dropping and mode collapsing, and detecting overfitting. With a series of carefully designed experiments, we comprehensively investigate existing sample-based metrics and identify their strengths and limitations in practical settings. Based on these results, we observe that kernel Maximum Mean Discrepancy (MMD) and the 1-Nearest- Neighbor (1-NN) two-sample test seem to satisfy most of the desirable properties, provided that the distances between samples are computed in a suitable feature space. Our experiments also unveil interesting properties about the behavior of several popular GAN models, such as whether they are memorizing training samples, and how far they are from learning the target distribution.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Generative adversarial networks (GANs) <ref type="bibr">(Goodfellow et al., 2014)</ref> have been studied extensively in recent years. Besides producing surprisingly plausible images <ref type="bibr">(Radford et al., 2015;</ref><ref type="bibr">Larsen et al., 2015;</ref><ref type="bibr">Karras et al., 2017;</ref><ref type="bibr">Arjovsky et al., 2017;</ref><ref type="bibr">Gulrajani et al., 2017)</ref>, they have also been innovatively applied in, for example, semi-supervised learning <ref type="bibr">(Odena, 2016;</ref><ref type="bibr">Makhzani et al., 2015)</ref>, image-to-image translation <ref type="bibr">(Isola et al., 2016;</ref><ref type="bibr">Zhu et al., 2017)</ref>, and simulated image refinement <ref type="bibr">(Shrivastava et al., 2016)</ref>. However, despite the availability of a plethora of GAN models <ref type="bibr">(Arjovsky et al., 2017;</ref><ref type="bibr">Qi, 2017;</ref><ref type="bibr">Zhao et al., 2016)</ref>, their evaluation is still predominantly qualitative, very often resorting to manual inspection of the visual fidelity of generated images. Such evaluation is time-consuming, subjective, and possibly misleading. Given the inherent limitations of qualitative evaluations, proper quantitative metrics are crucial for the development of GANs to guide the design of better models.</p><p>Possibly the most popular metric is the Inception Score <ref type="bibr">(Salimans et al., 2016)</ref>, which measures the quality and diversity of the generated images using an external model, the Google Inception network <ref type="bibr">(Szegedy et al., 2014)</ref>, trained on the large scale ImageNet dataset <ref type="bibr">(Deng et al., 2009)</ref>. Some other metrics are less widely used but still very valuable. <ref type="bibr">Wu et al. (2016)</ref> proposed a sampling method to estimate the log-likelihood of generative models, by assuming a Gaussian observation model with a fixed variance. <ref type="bibr">Bounliphone et al. (2015)</ref> propose to use maximum mean discrepancies (MMDs) for model selection in generative models. <ref type="bibr">Lopez-Paz &amp; Oquab (2016)</ref> apply the classifier two-sample test, a well-studied tool in statistics, to assess the difference between the generated and target distribution. Although these evaluation metrics are shown to be effective on various tasks, it is unclear in which scenarios their scores are meaningful, and in which other scenarios prone to misinterpretations. Given that evaluating GANs is already challenging it can only be more difficult to evaluate the evaluation metrics themselves. Most existing works attempt to justify their proposed metrics by showing a strong correlation with human evaluation <ref type="bibr">(Salimans et al., 2016;</ref><ref type="bibr">Lopez-Paz &amp; Oquab, 2016)</ref>. However, human evaluation tends to be biased towards the visual quality of generated samples and neglect the overall distributional characteristics, which are important for unsupervised learning.</p><p>In this paper we comprehensively examine the existing literature on sample-based quantitative evaluation of GANs. We address the challenge of evaluating the metrics themselves by carefully designing a series of experiments through which we hope to answer the following questions: 1) What are reasonable characterizations of the behavior of existing sample-based metrics for GANs? 2) What are the strengths and limitations of these metrics, and which metrics should be preferred accordingly? Our empirical observation suggests that MMD and 1-NN two-sample test are best suited as evaluation metrics on the basis of satisfying useful properties such as discriminating real versus fake images, sensitivity to mode dropping and collapse, and computational efficiency.</p><p>Ultimately, we hope that this paper will establish good principles on choosing, interpreting, and designing evaluation metrics for GANs in practical settings. We will also release the source code for all experiments and metrics examined (<ref type="url">https://github.com/xuqiantong/GAN-Metrics</ref>), providing the community with off-the-shelf tools to debug and improve their GAN algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background</head><p>We briefly review the original GAN framework proposed by <ref type="bibr">Goodfellow et al. (2014)</ref>. Description of the GAN variants used in our experiments is deferred to the Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Generative adversarial networks</head><p>Let X = R d&#215;d be the space of natural images. Given i.i.d. samples S r = {x r 1 , . . . , x r n } drawn from a real distribution P r over X , we would like to learn a parameterized distribution P g that approximates the distribution P r .</p><p>A generative adversarial network has two components, the discriminator D : X &#8594; [0, 1) and the generator G : Z &#8594; X , where Z is some latent space. Given a distribution P z over Z (usually an isotropic Gaussian), the distribution P g is defined as G(P z ). Optimization is performed with respect to a joint loss for D and G min</p><p>Intuitively, the discriminator D outputs a probability for every x &#8712; X that corresponds to its likelihood of being drawn from P r , and the loss function encourages the generator G to produce samples that maximize this probability. Practically, the loss is approximated with finite samples from P r and P g , and optimized with alternating steps for D and G using gradient descent.</p><p>To evaluate the generator, we would like to design a metric &#961; that measures the "dissimilarity" between P g to P r .<ref type="foot">foot_0</ref> In theory, with both distributions known, common choices of &#961; include the Kullback-Leibler divergence (KLD), Jensen-Shannon divergence (JSD) and total variation. However, in practical scenarios, P r is unknown and only the finite samples in S r are observed. Furthermore, it is almost always intractable to compute the exact density of P g , but much easier to sample S g = {x g 1 , . . . , x g m } &#8764; P m g (especially so for GANs). Given these limitations, we focus on empirical measures &#961; : X n &#215; X m &#8594; R of "dissimilarity" between samples from two distributions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Sample based metrics</head><p>We mainly focus on sample based evaluation metrics that follow a common setup illustrated in Figure <ref type="figure">1</ref>. The metric calculator is the key element, for which we briefly introduce five representative methods: Inception Score <ref type="bibr">(Salimans et al., 2016</ref><ref type="bibr">), Mode Score (Che et al., 2016</ref><ref type="bibr">) , Kernel MMD (Gretton et al., 2007)</ref>, Wasserstein distance, Fr&#233;chet Inception Distance (FID) <ref type="bibr">(Heusel et al., 2017)</ref>, and 1-nearest neighbor (1-NN)-based two sample test <ref type="bibr">(Lopez-Paz &amp; Oquab, 2016)</ref>. All of them are model agnostic and require only finite samples from the generator.</p><p>The Inception Score is arguably the most widely adopted metric in the literature. It uses a image classification model M, the Inception network <ref type="bibr">(Szegedy et al., 2016)</ref>, pre-trained on the ImageNet <ref type="bibr">(Deng et al., 2009)</ref> dataset, to compute</p><p>where p M (y|x) denotes the label distribution of x as predicted by M, and p M (y) =</p><p>x p M (y|x) dP g , i.e. the marginal of p M (y|x) over the probability measure P g . The expectation and the integral in p M (y|x) can be approximated with i.i.d. samples from P g . A higher IS has p M (y|x) close to a point mass, which happens when the Inception network is very confident that the image belongs to a particular ImageNet category, and has p M (y) close to uniform, i.e. all categories are equally represented. This suggests that the generative model has both high quality and diversity. <ref type="bibr">Salimans et al. (2016)</ref> show that the Inception Score has a reasonable correlation with human judgment of image quality. We would like to highlight two specific properties: 1) the distributions on both sides of the KL are dependent on M, and 2) the distribution of the real data P r , or even samples thereof, are not used anywhere.</p><p>The Mode Score<ref type="foot">foot_1</ref> is an improved version of the Inception Score. Formally, it is given by</p><p>where p M (y * ) =</p><p>x p M (y|x) dP r is the marginal label distribution for the samples from the real data distribution. Unlike the Inception Score, it is able to measure the dissimilarity between the real distribution P r and generated distribution P g through the term KL(p M (y)||p M (y * )).</p><p>Kernel MMD (Maximum Mean Discrepancy) is defined as</p><p>measures the dissimilarity between P r and P g for some fixed kernel function k. Given two sets of samples from P r and P g , the empirical MMD between the two distributions can be computed with finite sample approximation of the expectation. A lower MMD means that P g is closer to P r . The Parzen window estimate <ref type="bibr">(Gretton et al., 2007)</ref> can be viewed as a specialization of Kernel MMD.</p><p>The Wasserstein distance between P r and P g is defined as</p><p>where &#915;(P r , P g ) denotes the set of all joint distributions (i.e. probabilistic couplings) whose marginals are respectively P r and P g , and d(x r , x g ) denotes the base distance between the two samples. For discrete distributions with densities p r and p g , the Wasserstein distance is often referred to as the Earth Mover's Distance (EMD), and corresponds to the solution to the optimal transport problem</p><p>This is the finite sample approximation of WD(P r , P g ) used in practice. Similar to MMD, the Wasserstein distance is lower when two distributions are more similar.</p><p>The Fr&#233;chet Inception Distance (FID) was recently introduced by <ref type="bibr">Heusel et al. (2017)</ref> to evaluate GANs. For a suitable feature function &#966; (by default, the Inception network's convolutional feature), FID models &#966;(P r ) and &#966;(P g ) as Gaussian random variables with empirical means &#181; r , &#181; g and empirical covariance C r , C g , and computes</p><p>which is the Fr&#233;chet distance (or equivalently, the Wasserstein-2 distance) between the two Gaussian distributions <ref type="bibr">(Heusel et al., 2017)</ref>.</p><p>The 1-Nearest Neighbor classifier is used in two-sample tests to assess whether two distributions are identical. Given two sets of samples S r &#8764; P n r and S g &#8764; P m g , with |S r | = |S g |, one can compute the leave-one-out (LOO) accuracy of a 1-NN classifier trained on S r and S g with positive labels for S r and negative labels for S g . Different from the most common use of accuracy, here the 1-NN classifier should yield a &#8764; 50% LOO accuracy when |S r | = |S g | is large. This is achieved when the two distributions match. The LOO accuracy can be lower than 50%, which happens when the GAN overfits P g to S r . In the (hypothetical) extreme case, if the GAN were to memorize every sample in S r and re-generate it exactly, i.e. S g = S r , the accuracy would be 0%, as every sample from S r would have it nearest neighbor from S g with zero distance. The 1-NN classifier belongs to the two-sample test family, for which any binary classifier can be adopted in principle. We will only consider the 1-NN classifier because it requires no special training and little hyperparameter tuning.</p><p>Lopez-Paz &amp; Oquab (2016) considered the 1-NN accuracy primarily as a statistic for two-sample testing. In fact, it is more informative to analyze it for the two classes separately. For example, a typical outcome of GANs is that for both real and generated images, the majority of their nearest neighbors are generated images due to mode collapse. In this case, the LOO 1-NN accuracy of the real images would be relatively low (desired): the mode(s) of the real distribution are usually well captured by the generative model, so a majority of real samples from S r are surrounded by generated samples from S g , leading to low LOO accuracy; whereas the LOO accuracy of the generated images is high (not desired): generative samples tend to collapse to a few mode centers, thus they are surrounded by samples from the same class, leading to high LOO accuracy. For the rest of the paper, we distinguish these two cases as 1-NN accuracy (real) and 1-NN accuracy (fake).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Other metrics</head><p>All of the metrics above are, what we refer to as "model agnostic": they use the generator as a black box to sample the generated images S g . Model agnostic metrics should not require a density estimation from the model. We choose to only experiment with model agnostic metrics, which allow us to support as many generative models as possible for evaluation without modification to their structure. We will briefly mention some other evaluation metrics not included in our experiments.</p><p>Kernel density estimation (KDE, or Parzen window estimation) is a well-studied method for estimating the density function of a distribution from samples. For a probability kernel K (most often an isotropic Gaussian) and i.i.d samples x 1 , . . . , x n , we can define the density function at x as p(x) &#8776;</p><p>, where z is a normalizing constant. This allows the use of classical metrics such as KLD and JSD. However, despite the widespread adoption of this technique to various applications, its suitability to estimating the density of P r or P g for GANs has been questioned by <ref type="bibr">Theis et al. (2015)</ref> since the probability kernel depends on the Euclidean distance between images.</p><p>More recently, <ref type="bibr">Wu et al. (2016)</ref> applied annealed importance sampling (AIS) to estimate the marginal distribution p(x) of a generative model. This method is most natural for models that define a conditional distribution p(x|z) where z is the latent code, which is not satisfied by most GAN models. Nevertheless, AIS has been applied to GAN evaluation by assuming a Gaussian observation model. We exclude this method from our experiments as it needs the access to the generative model to compute the likelihood, instead of only depending on a finite sample set S g .</p><p>3 Experiments with GAN evaluation metrics</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Feature space</head><p>All the metrics introduced in the previous section, except for the Inception Score and Mode Score, access the samples x only through pair-wise distances. The Kernel MMD requires a fixed kernel function k, typically set to an isotopic Gaussian; the Wasserstein distance and 1-NN accuracy use the underlying distance metric d directly; all of these methods are highly sensitive to the choice that distance.</p><p>than &#961;(S g , S tr r ) when P g memorizes a part of S tr r . The difference between them can informally be viewed as a form of "generalization gap".</p><p>We simulate the overfitting process by defining S &#8242; r as a mix of samples from the training set S tr r and a second holdout set, disjoint from both S tr r and S val r . Figure <ref type="figure">8</ref> shows the gap &#961;(S g , S val r ) -&#961;(S g , S tr r ) of the various metrics as a function of the overlapping ratio between S &#8242; r and S tr r . The left most point of each curve can be viewed as the score &#961;(S &#8242; r , S val r ) computed on a validation set since the overlap ratio is 0. For better visualization, we normalize the Wasserstein distance and MMD by dividing their corresponding score when S &#8242; r and S r have no overlap. As shown in Figure <ref type="figure">8</ref>, all the metrics except RIS and RMS reflect that the "generalization gap" increases as S &#8242; r overfits more to S r . The failure of RIS is not surprising: it totally ignores the real data distribution as we discussed in Section 2.2. While the reason that RMS also fails to detect overfitting may again be its lack of generalization to datasets with classes not contained in the ImageNet dataset. In addition, RMS operates in the softmax space, the features in which might be too specific compared to the features in the convolutional space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Discussions and Conclusion</head><p>Based on the above analysis, we can summarize the advantages and inherent limitations of the six evaluation metrics, and conditions under which they produce meaningful results. With some of the metrics, we are able to study the problem of overfitting (see Appendix C), perform model selection on GAN models and compare them without resorting to human evaluation based on cherry-picked samples (see Appendix D).</p><p>The Inception Score does show a reasonable correlation with the quality and diversity of generated images, which explains the wide usage in practice. However, it is ill-posed mostly because it only evaluates P g as an image generation model rather than its similarity to P r . Blunt violations like mixing in natural images from an entirely different distribution completely deceives the Inception Score. As a result, it may encourage the models to simply learn sharp and diversified images (or even some adversarial noise), instead of P r . This also applies to the Mode Score. Moreover, the Inception Score is unable to detect overfitting since it cannot make use of a holdout validation set.</p><p>Kernel MMD works surprising well when it operates in the feature space of a pre-trained ResNet. It is always able to identify generative/noise images from real images, and both its sample complexity and computational complexity are low. Given these advantages, even though MMD is biased, we recommend its use in practice.</p><p>Wasserstein distance works well when the distance is computed in a suitable feature space. However, it has a high sample complexity, a fact that has also been observed by <ref type="bibr">(Arora et al., 2017)</ref>. Another key weakness is that computing the exact Wasserstein distance has a time complexity of O(n 3 ), which is prohibitively expensive as sample size increases. Compared to other methods, Wasserstein distance is less appealing as a practical evaluation metric.</p><p>Fr&#233;chet Inception Distance performs well in terms of discriminability, robustness and efficiency. It serves as a good metric for GANs, despite only modeling the first two moments of the distributions in feature space.</p><p>1-NN classifier appears to be an ideal metric for evaluating GANs. Not only does it enjoy all the advantages of the other metrics, it also outputs a score in the interval [0, 1], similar to the accuracy/error in classification problems. When the generative distribution perfectly match the true distribution, perfect score (i.e., 50% accuracy) is attainable. From Figure <ref type="figure">2</ref>, we find that typical GAN models tend to achieve lower LOO accuracy for real samples (1-NN accuracy (real)), while higher LOO accuracy for generated samples (1-NN accuracy (fake)). This suggests that GANs are able to capture modes from the training distribution, such that the majority of training samples distributed around the mode centers have their nearest neighbor from the generated images, yet most of the generated images are still surrounded by generated images as they are collapsed together. The observation indicates that the mode collapse problem is prevalent for typical GAN models. We also note that this problem, however, cannot be effectively detected by human evaluation or the widely used Inception Score.</p><p>Overall, our empirical study suggests that the choice of feature space in which to compute various metrics is crucial. In the convolutional space of a ResNet pretrained on ImageNet, both MMD and 1-NN accuracy appear to be good metrics in terms of discriminability, robustness and efficiency. Wasserstein distance has very poor sample efficiency, while Inception Score and Mode Score appear to be unsuitable for datasets that are very different from ImageNet. We will release our source code for all these metrics, providing researchers with an off-the-shelf tool to compare and improve GAN algorithms.</p><p>Based on the two most prominent metrics, MMD and 1-NN accuracy, we study the overfitting problem of DCGAN and WGAN (in Appendix C). Despite the widespread belief that GANs are overfitting to the training data, we find that this does not occur unless there are very few training samples. This raises an interesting question regarding the generalization of GANs in comparison to the supervised setting. We hope that future work can contribute to explaining this phenomenon. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Comparison of popular GAN models based on quantitative evaluation metrics</head><p>Based on our analysis, we chose MMD and 1-NN accuracy in the feature space of a 34-layer ResNet trained on ImageNet to compare several state-of-the-art GAN models. All scores are computed using 2000 samples from the holdout set and 2000 generated samples. The GAN models evaluated include DCGAN <ref type="bibr">(Radford et al., 2015)</ref>, WGAN <ref type="bibr">(Arjovsky et al., 2017)</ref>, WGAN with gradient penalty (WGAN-GP ) <ref type="bibr">(Gulrajani et al., 2017)</ref>, and LSGAN <ref type="bibr">(Mao et al., 2016)</ref> , all trained on the CelebA dataset. The results are reported in Table <ref type="table">1</ref>, from which we highlight three observations:</p><p>&#8226; WGAN-GP performs the best under most of the metrics.</p><p>&#8226; DCGAN achieves 0.759 overall 1-NN accuracy on real samples, slightly better than 0.765 achieved by WGAN-GP; while the 1-NN accuracy on generated (fake) samples achieved by DCGAN is higher than that by WGAN-GP (0.892 v.s. 0.860). This seems to suggest that DCGAN is better at capturing modes in the training data distribution, while its generated samples are more collapsed compared to WGAN-GP. Such subtle difference is unlikely to be discovered by the Inception Score or human evaluation. &#8226; The 1-NN accuracy for all evaluated GAN models are higher than 0.8 , far above the ground truth of 0.5. The MMD score of the four GAN models are also much larger than that of ground truth (0.019). This indicates that even state-of-the-art GAN models are far from learning the true distribution.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Note that &#961; does not need satisfy symmetry or triangle inequality, so it is not, mathematically speaking, a distance metric between Pg and Pr. We still call it a metric throughout this paper for simplicity.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>We use a modified version here, as the original one reduces to the Inception Score.</p></note>
		</body>
		</text>
</TEI>
