<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Understanding Symmetric Smoothing Filters: A Gaussian Mixture Model Perspective</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>11/01/2017</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10055642</idno>
					<idno type="doi">10.1109/TIP.2017.2731208</idno>
					<title level='j'>IEEE Transactions on Image Processing</title>
<idno>1057-7149</idno>
<biblScope unit="volume">26</biblScope>
<biblScope unit="issue">11</biblScope>					

					<author>Stanley H. Chan</author><author>Todd Zickler</author><author>Yue M. Lu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Many patch-based image denoising algorithms can be formulated as applying a smoothing filter to the noisy image. Expressed as matrices, the smoothing filters must be row normalized so that each row sums to unity. Surprisingly, if we apply a column normalization before the row normalization, the performance of the smoothing filter can often be significantly improved. Prior works showed that such performance gain is related to the Sinkhorn-Knopp balancing algorithm, an iterative procedure that symmetrizes a row-stochastic matrix to a doublystochastic matrix. However, a complete understanding of the performance gain phenomenon is still lacking.In this paper, we study the performance gain phenomenon from a statistical learning perspective. We show that Sinkhorn-Knopp is equivalent to an Expectation-Maximization (EM) algorithm of learning a Gaussian mixture model of the image patches. By establishing the correspondence between the steps of Sinkhorn-Knopp and the EM algorithm, we provide a geometrical interpretation of the symmetrization process. This observation allows us to develop a new denoising algorithm called Gaussian mixture model symmetric smoothing filter (GSF). GSF is an extension of the Sinkhorn-Knopp and is a generalization of the original smoothing filters. Despite its simple formulation, GSF outperforms many existing smoothing filters and has a similar performance compared to several state-of-the-art denoising algorithms.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Smoothing filters are a class of linear and nonlinear operators that gains significant attentions in image denoising recently. The formulations of these operators are simple: Consider a noisy observation y &#8712; R n of a clean image z &#8712; R n corrupted by additive i.i.d. Gaussian noise. A smoothing filter is a matrix W &#8712; R n&#215;n that generates a denoised estimate z as</p><p>where D def = diag {W 1} is a diagonal matrix for normalization so that each row of D -1 W sums to unity. The formulation in <ref type="bibr">(1)</ref> is very general, and many denoising algorithms are smoothing filters, e.g., Gaussian filter <ref type="bibr">[1]</ref>, bilateral filter <ref type="bibr">[2]</ref>, non-local means (NLM) <ref type="bibr">[3]</ref>, locally adaptive regression kernel (LARK) <ref type="bibr">[4]</ref>, etc. Note that some of these filters are linear (e.g., Gaussian filter) whereas some are nonlinear (e.g. non-local means). There are interesting graph-theoretic interpretations of the smoothing filters <ref type="bibr">[5]</ref>- <ref type="bibr">[10]</ref>, and there are also fast algorithms to compute the smoothing filters <ref type="bibr">[11]</ref>- <ref type="bibr">[16]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Motivation: A Surprising Phenomenon</head><p>While smoothing filters work well for many denoising problems, it was observed in <ref type="bibr">[17]</ref>- <ref type="bibr">[19]</ref> that their performance can be further improved by modifying <ref type="bibr">(1)</ref> as</p><p>where D c def = diag W T 1 is a diagonal matrix that normalizes the columns of W , and D r def = diag W D -1 c 1 is a diagonal matrix that normalizes the rows of W D -1  c . In other words, we modify (1) by introducing a column normalization step before applying the row normalization.</p><p>Before discussing the technical properties of (1) and ( <ref type="formula">2</ref>), we first provide some numerical results to demonstrate an interesting phenomenon. In Figure <ref type="figure">1</ref>, we crop the center 100 &#215; 100 region of 10 standard clean images. We generate noisy observations by adding i.i.d. Gaussian noise of standard deviation &#963; = 20/255 to each clean image. These noisy images are then denoised by <ref type="bibr">(1)</ref> and <ref type="bibr">(2)</ref>, respectively. The weight matrix W is chosen as the one defined in the non-local means (NLM) <ref type="bibr">[3]</ref>. To ensure fair comparison, we choose the best parameter h r , the range parameter in NLM, for both methods. The patch size is set as 5 &#215; 5 and the neighborhood search window is set as 21 &#215; 21. The experiment is repeated for 20 independent Monte-Carlo trials to average out the randomness caused by different realizations of the i.i.d. Gaussian noise.</p><p>The results of this experiment are shown at the bottom of Figure <ref type="figure">1</ref>. It is perhaps a surprise to see that <ref type="bibr">(2)</ref>, which is a simple modification of (1), improves the PSNR by more than 0.23 dB on average. Another puzzling observation is that if we repeatedly apply the column-row normalization, the PSNR does not always increase as more iterations are used. Figure <ref type="figure">2</ref> presents the result. In this experiment, we fix the NLM parameter h r to its optimal value when using the column-row normalization, i.e., <ref type="bibr">(2)</ref>. For 5 out of the 10 images we tested, the PSNR values actually drop after the first column-row normalization.</p><p>The above experiment piqued our curiosity and led us to a basic question: Why would the column-row normalization improve the denoising performance? <ref type="bibr">Insights</ref>  0.71&#963; 0.77&#963; 0.77&#963; 0.77&#963; 0.71&#963; 0.71&#963; 0.79&#963; 0.77&#963; 0.77&#963; 0.79&#963; PSNR Improvement +0.10 +0.34 +0.23 +0.17 +0.29 +0.07 +0.33 +0.28 +0.05 +0.37  a better understanding of the underlying mechanism could potentially lead to a more systematic procedure that can generalize the operations in <ref type="bibr">(2)</ref> and further improve the denoising performance. The goal of this paper is to address this issue and propose a new algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Sinkhorn-Knopp Balancing Algorithm</head><p>To the best of our knowledge, the above performance gain phenomenon was first discussed by Milanfar in <ref type="bibr">[17]</ref>, where it was shown that if we repeatedly apply the columnrow normalization we would obtain an iterative procedure called the Sinkhorn-Knopp balancing algorithm <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref> (or Sinkhorn-Knopp, for short.) As illustrated in Algorithm 1, Sinkhorn-Knopp is a simple algorithm that repeatedly applies the column and row normalization until the smoothing filter converges. For example, the smoothing filter defined by ( <ref type="formula">2</ref>) is the result of applying Sinkhorn-Knopp for one iteration.</p><p>Sinkhorn-Knopp has many interesting properties. First,</p><p>when Sinkhorn-Knopp converges, the converging limit is a doubly-stochastic matrix -a symmetric non-negative matrix with unit column and row (also called a symmetric smoothing filter.) A doubly stochastic matrix has all of its eigenvalue's magnitudes bounded in [0, 1] so that repeated multiplications always attenuate the eigenvalues <ref type="bibr">[22]</ref>. Moreover, the estimate z formed by a doubly stochastic matrix is admissible in the sense that no other estimates are uniformly better <ref type="bibr">[23]</ref>.</p><p>To explain the performance gain, Milanfar <ref type="bibr">[17]</ref> considered a notion called "effective degrees of freedom", defined as</p><p>where z j is the jth pixel of the estimate and y j is the jth pixel of the input. df measures how an estimator trades bias against variance. Larger values of df imply a lower bias but higher variance. Milanfar argued that the overall mean squared error, which is the sum of the bias and the variance, is reduced because one can prove that symmetric smoothing filters have high effective degrees of freedom. However, effective degrees of freedom is not easy to interpret. It will be more useful if we can geometrically describe the actions to which the columnrow normalization are applying.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Contributions</head><p>The present paper is motivated by our wish to further understand the mechanism behind the performance gain phenomenon. Our approach is to study an Expectation-Maximization (EM) algorithm for learning a Gaussian mixture model (GMM.) By analyzing the E-step and the M-step of the EM algorithm, we find that the actions of the symmetrization is a type of data clustering. This observation echoes with a number of recent work that shows ordering and grouping of non-local patches are key to high-quality image denoising <ref type="bibr">[24]</ref>- <ref type="bibr">[26]</ref>. There are two contributions of this paper, described as follows.</p><p>First, we generalize the symmetrization process by reformulating the denoising problem as a maximum-a-posteriori (MAP) estimation under a Gaussian mixture model. We show that the original smoothing filter in (1), the one-step Sinkhorn-Knopp in <ref type="bibr">(2)</ref>, and the full Sinkhorn-Knopp (i.e., iterative applications of Sinkhorn-Knopp until convergence) are all subroutines of the EM algorithm to learn the GMM. By showing that each method supersedes its preceding counterpart, we provide a possible explanation for the performance gain phenomenon.</p><p>Second, based on the analysis of the GMM, we propose a new denoising algorithm called the GMM symmetric smoothing filter (GSF). We show that GSF does not only subsume a number of smoothing filters, but also has a performance similar to some state-of-the-art denoising algorithms. We will discuss implementation and parameter selections for the GSF algorithm.</p><p>This paper is an extension of a conference article presented in <ref type="bibr">[19]</ref>. In <ref type="bibr">[19]</ref>, the linkage between the GMM and the MAP denoising step was not thoroughly discussed. Specifically, the MAP was formulated as a weighted least squares step but the weights were indirectly obtained through a by-product of the EM algorithm. In this paper, we show that the quadratic cost function in the weighted least squares is indeed a surrogate function for solving the MAP problem. Therefore, minimizing the cost function of the weighted least squares is equivalent to minimizing an upper bound of the MAP objective function.</p><p>The rest of the paper is organized as follows. First, we provide a brief introduction to GMM and the EM algorithm in Section II. In Section III we discuss the generalization of different symmetrizations using the EM algorithm. The new GSF is discussed in Section IV and experimental results are shown in Section V. We conclude in Section VI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. PRELIMINARIES</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Notations</head><p>Throughout this paper, we use n to denote the number of pixels in the noisy image, and k to denote the number of mixture components in the GMM model. To avoid ambiguity, we call a mixture component of a GMM as a cluster. Clusters are tracked using the running index i &#8712; {1, . . . , k}, whereas patches (or pixels) are tracked using the running index j &#8712; {1, . . . , n}. Without loss of generality, we assume that all pixel intensity values have been normalized to the range [0, 1].</p><p>We use bold letters to denote vectors and the symbol 1 to denote a constant vector of all ones. The vector x j &#8712; R 2 represents the two-dimensional spatial coordinate of the jth pixel, and the vector y j &#8712; R d represents a d-dimensional patch centered at the jth pixel of the noisy image y. For a clean image z, the jth patch is denoted by z j . A scalar y j &#8712; R refers </p><p>to the intensity value of the center pixel of y j . Therefore, a d-dimensional patch y j (assume d is an odd number) is a vector y j = [y j-(d-1)/2 , . . . , y j , . . . , y j+(d-1)/2 ] T . To extract y j from the noisy image y, we use a linear operator P j &#8712; R d&#215;n such that y j = P j y. For some smoothing filters, the spatial coordinate x j is used together with a patch z j (or y j ). Therefore, for generality we define a generalized patch p j &#8712; R p , which could be x j , z j (or y j ), or a concatenation of both:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Smoothing Filters</head><p>The results presented in this paper are applicable to smoothing filters W taking the following form:</p><p>where &#954; i def = (2&#960;) p |&#931; i | is a normalization constant, and N (&#8226;) denotes a p-dimensional Gaussian:</p><p>with mean &#181; i &#8712; R p and covariance matrix &#931; i &#8712; R p&#215;p . We note that (3) is general and covers a number of widely used filters as shown in Table <ref type="table">I</ref>.</p><p>Example 1 (Standard NLM): For the standard NLM <ref type="bibr">[3]</ref>, we have p j = y j . The ith mean vector is &#181; i = y i , i.e., the noisy patch is the mean vector. The ith covariance matrix is</p><p>where h r is the NLM parameter.</p><p>In practice, to reduce computational complexity, a search window &#8486; is often introduced so that neighboring patches are searched within &#8486;. This is equivalent to multiplying an indicator function to the weight as</p><p>where I {x &#8712; &#8486;} = 1 if x &#8712; &#8486;, and is zero otherwise.</p><p>Example 2 (Spatially Regulated NLM): As an alternative to the hard thresholding introduced by the indicator function in <ref type="bibr">(6)</ref>, one can also consider the spatially regulated NLM <ref type="bibr">[17]</ref>. In this case, we can define p j = x T j , y T j T . The mean vector &#181; i and the covariance matrices will consist of two parts:</p><p>where h s and h r are the spatial and the range parameters, respectively. This leads to the weight</p><p>It is not difficult to see that the spatially regulated NLM coincides with the standard NLM as h s &#8594; &#8734; and |&#8486;| &#8594; &#8734;.</p><p>In fact, the former uses a soft search window and the latter uses a hard search window. From our experience, we find that the spatially regulated NLM typically has better performance than the standard NLM when the best choices of h s and &#8486; are used in either case. Therefore, in the rest of this paper we will focus on the spatially regulated NLM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Gaussian Mixture Model</head><p>The Gaussian mixture model (GMM) plays an important role in this paper. Consider a set of n generalized patches P def = {p 1 , . . . , p n } where p j &#8712; R p . We say that P is generated from a Gaussian mixture model of k clusters if p j is sampled from the distribution</p><p>where &#960; i &#8712; R is the weight of the ith cluster, &#181; i &#8712; R p is the mean vector, and &#931; i &#8712; R p&#215;p is the covariance matrix. For the purpose of analyzing smoothing filters in this paper, we shall assume that the covariance matrices &#931; i are fixed according to the underlying smoothing filter. For example, in spatially regulated NLM we fix the covariance matrices as &#931; i = &#931; NLM . When &#931; i 's are fixed, we denote</p><p>Learning the model parameters &#920; from P is typically done using the Expectation-Maximization (EM) algorithm <ref type="bibr">[27]</ref>. The EM algorithm consists of two major steps: the expectation step (E-step) and the maximization step (M-step). The E-step is used to compute the conditional expected log-likelihood, often called the Q-function. The M-step is used to maximize the Q-function by seeking the optimal parameters &#920;. The algorithm iterates until the log-likelihood converges. Since the EM algorithm is widely used, we skip the introduction and refer readers to <ref type="bibr">[27]</ref> for a comprehensive tutorial. The EM algorithm for learning a GMM is summarized in Algorithm 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. GENERALIZATIONS OF SYMMETRIC FILTERS</head><p>In this section, we discuss how various symmetric smoothing filters are generalized by the EM algorithm. We begin by discussing how the GMM can be used for denoising.</p><p>Algorithm 2 EM Algorithm for Learning a GMM with a known covariance matrix &#931; <ref type="bibr">[27]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Input: Patches</head><p>, and the number of clusters k.</p><p>for i = 1, . . . , k, and set t = 0. while not converge do E-step: Compute, for i = 1, . . . , k and j = 1, . . . , n &#947;</p><p>Update Counter: t &#8592; t + 1. end while</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. MAP Denoising Using GMM</head><p>We first specify the denoising algorithm. Given the noisy image y, we formulate the denoising problem by using the maximum-a-posterior (MAP) approach:</p><p>where the first term specifies the data fidelity with &#955; as a parameter. The second term, which is a sum of overlapping patches (thus are dependent), is called the expected patch loglikelihood (EPLL) <ref type="bibr">[25]</ref>. Note that EPLL is a general prior that uses the expectation of the log-likelihood of overlapping patches. It is not limited to a particular distribution for f (p j | &#920;), although in <ref type="bibr">[25]</ref> the GMM was used. Note also that the patch p j in ( <ref type="formula">13</ref>) is extracted from the optimization variable z. Thus, by minimizing over z we also minimize over p j . Substituting ( <ref type="formula">9</ref>) into ( <ref type="formula">13</ref>), we obtain a GMM-based MAP denoising formulation</p><p>The computational challenge of ( <ref type="formula">14</ref>) is the sum of exponentials inside the logarithm, which hinders closed-form solutions. In <ref type="bibr">[25]</ref>, Zoran and Weiss proposed to use a half quadratic splitting method to alternatingly minimize a sequence of subproblems, and select the mode of the GMM in each subproblem.</p><p>Our solution to handle the optimization problem in ( <ref type="formula">14</ref>) is to use a surrogate function under the Majorization-Maximization framework <ref type="bibr">[28]</ref>. The idea is to find a surrogate function</p><p>If we can find such function h(p j , p &#8242; j | &#920;), then the minimization in ( <ref type="formula">13</ref>) can be relaxed to minimizing an upper bound</p><p>For the GMM in ( <ref type="formula">9</ref>), Zhang et al. <ref type="bibr">[29]</ref> proposed one convenient surrogate function h(p j , p &#8242; j | &#920;) whose expression is given in Lemma 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 1 (Surrogate function for GMM):</head><p>The function</p><p>where</p><p>is a surrogate function that satisfies ( <ref type="formula">15</ref>) and ( <ref type="formula">16</ref>).</p><p>Proof: See Appendix A of <ref type="bibr">[29]</ref>.</p><p>This variable can be chosen as p &#8242; j = [x T j , y T j ] T , i.e., the generalized patch using the noisy image y. Thus, &#947; ij is independent of p j .</p><p>Substituting the result of Lemma 1 into (17), we observe that (17) can be rewritten as</p><p>where we dropped terms involving p &#8242; j as they are independent to the optimization variable z. Note that ( <ref type="formula">20</ref>) is a quadratic minimization, which is significantly easier than <ref type="bibr">(14)</ref>.</p><p>For the case of spatially regulated NLM, it is possible to further simplify <ref type="bibr">(20)</ref> by recognizing that p j involves both spatial coordinate x j and patch z j . Therefore, by expressing p j = [x T j , z T j ] T and by defining &#181; i using ( <ref type="formula">7</ref>), <ref type="bibr">(20)</ref> becomes</p><p>.</p><p>In this minimization, we observe that since x j is not an optimization variable, it can be eliminated without changing the objective function. By using a patch extract operator P j , i.e., z j = P j z, we obtain the minimization</p><p>where we have absorbed the NLM parameter 2h 2 r (the diagonal term of &#931;) into &#955;. Problem (P 1 ) is the main optimization of interest in this paper. We call (P 1 ) the GMM Symmetric Smoothing Filter (GSF). In the literature, there are other GMM based denoising algorithms. Their connections to GSF will be discussed in Section IV .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Original Smoothing Filter</head><p>We now discuss the role of symmetrization by studying (P 1 ) and the EM algorithm for learning the GMM. To keep track of the iterations of the EM algorithm, we use the running index t denotes the iteration number. For consistency, we will focus on the spatially regulated NLM.</p><p>In spatially regulated NLM, the ith pixel of the denoised image is</p><p>with W ij defined in <ref type="bibr">(8)</ref>. To obtain ( <ref type="formula">21</ref>) from (P 1 ), we consider the following choices of parameters</p><p>In this case, since the quadratic objective in (P 1 ) is separable, the ith term becomes</p><p>which coincides with <ref type="bibr">(21)</ref> as</p><p>It is important to study the conditions in <ref type="bibr">(22)</ref>. First, the patch extractor P j extracts a pixel z j from z. This step is necessary because NLM is a weighted average of individual pixels, even though the weights are calculated using patches. Accordingly, the mean vector &#181; i is also a pixel y i so that it matches with the optimization variable z i . From a clustering perspective, we can interpret &#181; i = y i as having one cluster center for one pixel, i.e., every pixel has its own cluster. Clearly, this is a suboptimal configuration, and we will address it when we present the proposed method. We also note that in <ref type="bibr">(22)</ref>, the parameter &#955; is 0. What it means is that the data fidelity term is not used and only the EPLL prior is required to obtain the NLM result.</p><p>The most interesting observation in <ref type="bibr">(22)</ref> is the choice of &#947; ij . In fact, the &#947; ij in ( <ref type="formula">22</ref>) is only the numerator of ( <ref type="formula">19</ref>) by assuming &#960; i = 1/k. If we let &#920; i = (&#960; i , &#181; i ) be the ith model parameter of a GMM, then the physical meaning of ( <ref type="formula">22</ref>) is a conditional probability of observing p j given that we pick the ith cluster, i.e., P(p j | &#920; i ). In contrast, ( <ref type="formula">19</ref>) is a posterior probability of picking the ith cluster given that we observe p j , i.e., P(&#920; i | p j ). We will discuss this subtle difference more carefully in the next subsection when we discuss the one-step Sinkhorn-Knopp.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. One-step Sinkhorn-Knopp</head><p>In one-step Sinkhorn-Knopp, we recognize that the ith pixel of the denoised image is</p><p>This result can be obtained from (P 1 ) by letting</p><p>Unlike the conditions posted by the original smoothing filter in <ref type="bibr">(22)</ref>, one-step Sinkhorn-Knopp defines &#947; ij according to <ref type="bibr">(19)</ref>, or with some substitutions that yields <ref type="bibr">(25)</ref>.</p><p>As mentioned in the previous subsection, the &#947; ij defined in ( <ref type="formula">22</ref>) is a conditional probability whereas that in ( <ref type="formula">25</ref>) is a posterior probability. The posterior probability, if we refer to the EM algorithm (see Algorithm 2), is in fact the E-step with initializations &#960;</p><p>For the original filter, the E-step is not executed because &#947; ij is defined through <ref type="bibr">(22)</ref>. Therefore, from a clustering perspective, it is reasonable to expect a better denoising result from one-step Sinkhorn-Knopp than the original smoothing filter because the clustering is better performed.</p><p>To further investigate the difference between the conditional probability P(p j | &#920; i ) and the posterior probability P(&#920; i | p j ), we adopt a graph perspective by treating each patch p j as a node, and &#947; ij as the weight on the edge linking node i and node j <ref type="bibr">[5]</ref>. In the original smoothing filter, the conditional probability P(p j | &#920; i ) causes a "majority vote" effect, meaning that the parameter &#920; i has a direct influence to every patch {p j } in the image. Therefore, if &#920; i has many "weak friends", the sum of these "weak friends" can possibly alter the denoising result which would have been better obtained from a few "good friends".</p><p>In contrast, the one-step Sinkhorn-Knopp uses the posterior probability P(&#920; i | p j ). From Bayes rule, the posterior probability is related to the conditional probability by</p><p>Since P(&#920; i ) = &#960; i and &#960; i = 1/k, we see that P(&#920; i | p j ) is the ratio of P(p j | &#920; i ) and P(p j ). P(p j ) measures the popularity of p j . Thus, if p j is a popular patch (i.e., it is a "friend" of many), then the normalization P(p j | &#920; i )/P(p j ) is a way to balance out the influence of p j . Interpreted in another way, it is equivalent to say that if &#920; i has many "weak friends", the influence of these "weak friends" should be reduced. Such intuition is coherent to many NLM methods that attempt to limit the number of nearby patches, e.g., <ref type="bibr">[30]</ref>, <ref type="bibr">[31]</ref>.</p><p>The definite answer to whether there is performance gain due to one-step Sinkhorn-Knopp is determined by the likelihood of obtaining "weak friends". This, in turn, is determined by the image content and the NLM parameter h r which controls the easiness of claiming "friends". For example, if an image contains many textures of a variety of content and if h r is large, then it is quite likely to obtain "weak friends". In this case, one-step Sinkhorn-Knopp will improve the denoising performance. On the other hand, if an image contains only a constant foreground and a constant background, then most patches are "good friends" already. Applying the one-step Sinkhorn-Knopp could possibly hurt the denoising performance. Figure <ref type="figure">3</ref> shows an example.   To further justify the above claim that the performance gain is caused by the likelihood of obtaining "weak friends", we consider the 10 images in Figure <ref type="figure">1</ref> by plotting the PSNR gain as a function of h r . Our hypothesis is that for small h r , the performance gain should be small or even negative because it is difficult for a patch to find its "friends". When h r is large, the performance gain should become significant because many "weak friends" will be balanced out by the one-step Sinkhorn-Knopp. As shown in Figure <ref type="figure">4</ref>, this is in fact the case: For h r lies between 0 and certain threshold (around 0.65&#963; where &#963; is the noise standard deviation), PSNR gain is always zero or negative. When h r increases, PSNR gain becomes positive. The result is consistent for all 10 images we tested.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Full Sinkhorn-Knopp</head><p>The full Sinkhorn-Knopp (Algorithm 1) is an iterative algorithm that repeatedly applies the one-step Sinkhorn-Knopp until convergence. To analyze the full Sinkhorn-Knopp algorithm, we first recognize that the algorithm can be effectively described by two steps:</p><p>where the first equation is a row normalization and the second equation is a column normalization. This pair of normalization can be linked to the EM algorithm in the following sense. First, in the EM algorithm we fix the mixture weight &#960; (t) i as &#960;</p><p>(t) i = 1/k for all clusters i = 1, . . . , k and all iteration numbers t = 1, . . . , t max . This step is necessary because the full Sinkhorn-Knopp does not involve mixture weights. Intuitively, setting &#960; (t) i = 1/k ensures that all clusters have equal probability to be selected.</p><p>When</p><p>is fixed, the EM algorithm in Algorithm 2 has only two steps: Update of &#947; (t) ij in <ref type="bibr">(10)</ref> and update of &#181; (t) i in <ref type="bibr">(12)</ref>. Inspecting this pair of equations, we observe that (10) appears a column normalization whereas (12) appears a row normalization. However, since full Sinkhorn-Knopp does not have a mean vector &#181; (t) i , we have to modify the EM algorithm in order to link the two. To this end, we modify ( <ref type="formula">12</ref>) by defining a sequence &#946;</p><p>and modify <ref type="bibr">(10)</ref> by updating of &#947;</p><p>Under this setting, ( <ref type="formula">27</ref>)-( <ref type="formula">28</ref>) becomes exactly <ref type="bibr">(26)</ref>.</p><p>It is important to understand the difference between the original EM algorithm using (10)-( <ref type="formula">12</ref>) and the modified EM algorithm using ( <ref type="formula">27</ref>)- <ref type="bibr">(28)</ref>. In the M-step of the modified EM algorithm, the mean vector &#181; (t) i is absent. However, &#181; (t) i is still updated, though implicitly, because</p><p>where (a) follows from <ref type="bibr">(27)</ref>. Therefore, &#946;</p><p>ij are the coefficients that form &#181; (t) i through a linear combination of p j 's.</p><p>In the E-step of the modified EM algorithm, since &#181; (t) i is absent, one cannot compute the conditional probability N (p j | &#181; (t) i , &#931;) and hence the posterior probability in <ref type="bibr">(10)</ref>. To resolve this issue, we recognize that since</p><p>ij is the coefficient for the mean vector as shown in <ref type="bibr">(29)</ref> The fact that the full Sinkhorn-Knopp does not resemble a complete EM algorithm offers some insights into the perfor-mance gain phenomenon. Recall from the results in Figure <ref type="figure">2</ref>, we observe that the first Sinkhorn-Knopp iteration always increases the PSNR except the artificial images we discussed in Section III.C. This can be interpreted as that the clustering is properly performed by the EM algorithm. However, as more Sinkhorn-Knopp iterations are performed, some images show reduction in PSNR, e.g., images 3, 4, 8, 10. A close look at these images suggests that they contain complex texture regions that are difficult to form few but distinctive clusters. In this case, the approximation of</p><p>ij is weak and hence the denoising performance drops.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Summary</head><p>To summarize our findings, we observe that the performance of the normalization is related to how the EM algorithm is being implemented. A summary of these findings is shown in Table <ref type="table">II</ref>. For all the three algorithms we considered: the original filter, the one-step Sinkhorn-Knopp, and the full Sinkhorn-Knopp algorithm, the EM algorithm is not completely performed or in-properly configured. For example, setting k = n causes excessive number of clusters and should be modified to k &lt; n; the MAP parameter &#955; is always 0 and should be changed to a positive value to utilize the data fidelity term in (P 1 ); the E-step and the M-step are not performed as it should be. Therefore, in the following section we will propose a new algorithm that completely utilizes the EM steps for problem (P 1 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. GMM SYMMETRIC SMOOTHING FILTER</head><p>The proposed denoising algorithm is called the Gaussian Mixture Model Symmetric Smoothing Filter (GSF). The overall algorithm of GSF consists of two steps:</p><p>&#8226; Step 1: Estimate the GMM parameter &#181; (r) i and &#947; ij from the noisy image the by EM algorithm.</p><p>&#8226; Step 2: Solve Problem (P 1 ), which has a closed form solution. In the followings we discuss how GSF is implemented.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Closed Form Solution of GSF</head><p>First of all, we recall that since (P 1 ) is a quadratic minimization, it is possible to derive a closed form solution by considering the first order optimality condition, which yields a normal equation</p><p>where the vector w j is defined as</p><p>Equations ( <ref type="formula">30</ref>)-( <ref type="formula">31</ref>) has a simple interpretation: The intermediate vector w j is a weighted average of the mean vectors {&#181; (r) i } k i=1 . These {w j } n j=1 represent a collection of (denoised) overlapping patches. The operation P T j on the right hand side of (30) aggregates these overlapping patches, similar to the TABLE II: Generalization and comparisons using EM algorithm for learning GMM with a known &#931;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Original</head><p>One-step Full GSF Filter <ref type="bibr">[3]</ref> Sinkhorn-Knopp <ref type="bibr">[18]</ref> Sinkhorn-Knopp <ref type="bibr">[21]</ref> (Proposed) No. Clusters</p><p>aggregation step in BM3D <ref type="bibr">[32]</ref>. The addition of &#955;y regulates the final estimate by adding a small amount of fine features, depending on the magnitude of &#955;.</p><p>In order to use ( <ref type="formula">30</ref>)-( <ref type="formula">31</ref>), we must resolve two technical issues related to the EM algorithm and Problem (P 1 ): (i) How to determine &#955;; (ii) How to determine k.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Parameter &#955;</head><p>Ideally, &#955; should be chosen as the one that minimizes the mean squared error (MSE) of the denoised image. However, in the absence of the ground truth, MSE cannot be calculated directly. To alleviate this difficulty, we consider the Stein's Unbiased Risk Estimator (SURE) <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>. SURE is a consistent and unbiased estimator of the MSE. That is, SURE converges to the true MSE as the number of observations grows. Therefore, when there are sufficient number of observed pixels (which is typically true for images), minimizing the SURE is equivalent to minimizing the true MSE.</p><p>In order to derive SURE for our problem, we make an assumption about the boundary effect of P j .</p><p>Assumption 1: We assume that the patch-extract operator {P j } n j=1 satisfies the following approximation:</p><p>We note Assumption 1 only affects the boundary pixels and not the interior pixels. Intuitively, what Assumption 1 does is to require that the boundary pixels of the image are periodically padded instead of zero-padded. In the image restoration literature, periodic boundary padding is common when analyzing deblurring methods, e.g., <ref type="bibr">[35]</ref>.</p><p>Under Assumption 1, we can substitute (32) into (30) and take the matrix inverse. This would yield</p><p>where Then, we can derive the SURE of z as follows.</p><p>Proposition 1: Under Assumption 1, the SURE of z(&#955;) is</p><p>where &#963; 2 def = 1 n uy 2 , and</p><p>where e j &#8712; R d is the jth standard basis. Proof: See Appendix B. The SURE given in ( <ref type="formula">35</ref>) is a one-dimensional function in &#955;. The minimizer can be determined in closed-form.</p><p>Corollary 1: The optimal &#955; that minimizes SURE(&#955;) is</p><p>Proof: <ref type="bibr">(35)</ref> is a differentiable function in &#955;. Therefore, the minimizer can be determined by considering the first order optimality and set the derivative of SURE(&#955;) to zero. The projection operator max(&#8226;, 0) is placed to ensure that &#955; * &#8805; 0.</p><p>Example 3: To demonstrate the effectiveness of SURE, we show a typical MSE and a typical SURE curve of a denoising problem. In this example, we consider a 128 &#215; 128 image "Baboon", with noise standard deviation of &#963; = 30/255. The non-local means parameters are h r = &#963; and h s = 10. The number of clusters is k = 50, and the patch size is 5 &#215; 5. The results are shown in Figure <ref type="figure">5</ref>, where we observe that the SURE curve and the true MSE curve are very similar. In fact, the minimizer of the true MSE is &#955; = 8.0080 with a PSNR of 24.5143dB whereas the minimizer of SURE is &#955; = 7.9145 with a PSNR of 24.5141dB.</p><p>Remark 2: Careful readers may notice that in <ref type="bibr">(36)</ref>, we implicitly assume that &#947; ij is independent of y j . This implicit assumption is generally not valid if &#947; ij is learned from y. However, in practice, we find that if we feed the EM algorithm with some initial estimate (e.g., by running the algorithm with &#955; = 0), then the dependence of &#947; ij from y j becomes negligible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Number of k</head><p>The number of clusters k is another important parameter. We estimate k based on the concept of cross validation <ref type="bibr">[36]</ref>.</p><p>Our proposed cross-validation method is based on comparing the estimated covariance with &#931; NLM . More specifically, we compute the estimated covariance</p><p>where</p><p>i ] T is the mean returned by the EM algorithm, and</p><p>is the converged weight. Then, we compute the ratio of the deviation</p><p>Ideally, if &#931; i = &#931; NLM , then by (39) we have &#948; i (k) = 1. However, if the ith estimated Gaussian component has a radius significantly larger than h r (or, h s for the spatial components), then the covariance &#931; i would deviate from &#931; NLM and hence &#948; i (k) &gt; 1. Conversely, if the ith estimated Gaussian component has a radius significantly smaller than h r , then we will have &#948; i (k) &lt; 1. Therefore, the goal of the cross validation is to find a k such that &#948; i (k) is close to 1.</p><p>To complete the cross-validation setup, we average &#948; i (k) over all k clusters to obtain an averaged ratio</p><p>The parenthesis (k) in (40) emphasizes that both &#948;(k) and &#948; i (k) are functions of k. With (40), we seek the root k of the equation &#948;(k) = 1.</p><p>The root finding process for &#948;(k) = 1 can be performed using the secant method. Secant method is an extension of</p><p>Fig. <ref type="figure">6</ref>: Illustration of the secant method. Given k (a) and k (b) , we compute k (c) according to the slope defined by the line linking &#948; (a) and &#948; (b) .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 3 Cross Validation to Determine k</head><p>Input: k (a) and k (b) such that &#948; (a) &gt; 1 and &#948; (b) &lt; 1.</p><p>. end if end while the bisection method in which the bisection step size (i.e., 1/2) is now replaced by an adaptive step size determined by the local derivative of the function. Let k (a) and k (b) be two number of clusters, and &#948; (a) and &#948; (b) be the corresponding cross-validation scores, i.e., &#948; (a) = &#948;(k (a) ). If &#948; (a) &gt; 1 and &#948; (b) &lt; 1, the secant method computes the new k as</p><p>If &#948;(k (c) ) &gt; 1, then we replace k (a) by k (c) ; Otherwise, we replace k (b) by k (c) . The process repeats until the |k (a) -</p><p>A pictorial illustration of the secant method is shown in Figure <ref type="figure">6</ref>. A pseudo code is given in Algorithm 3.</p><p>Example 4: To verify the effectiveness of the proposed cross validation scheme, we consider a 128 &#215; 128 "House" image with noise &#963; = 60/255. The patch size is 5&#215;5, h r = &#963;, and h s = 10. Figure <ref type="figure">7</ref> shows the PSNR value of the denoised image and the corresponding cross validation score &#948;(k) as a function of k. For this experiment, the maximum PSNR is achieved at k = 144, where PSNR = 26.0257dB. Using the cross-validation score &#948;(k), we find that &#948;(k) is closest to 1 when k = 130. The corresponding PSNR value is 25.9896dB, which is very similar to the true maximum PSNR.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EXPERIMENTS</head><p>In this section, we present additional simulation results to evaluate the proposed GSF. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experiment Settings</head><p>We consider 10 testing images, each of which is resized to 128 &#215; 128 (so n = 16384) for computational efficiency. The noise standard deviations are set as &#963; &#8712; {20/255, 40/255, 60/255, 80/255, 100/255}. Several existing denoising algorithms are studied, namely the NLM <ref type="bibr">[3]</ref>, One-step Sinkhorn-Knopp <ref type="bibr">[18]</ref>, BM3D <ref type="bibr">[32]</ref>, EPLL <ref type="bibr">[25]</ref>, Global image denoising (GLIDE) <ref type="bibr">[12]</ref>, NL-Bayes <ref type="bibr">[37]</ref>, and PLE <ref type="bibr">[38]</ref>. The parameters of the methods are configured as shown in Table <ref type="table">III</ref>.</p><p>For NLM and One-step Sinkhorn-Knopp (One-step, in short), we use the spatially regulated version due to its better performance over the standard NLM. We implement the algorithms by setting the patch size as 5&#215;5 (i.e., d = 25). The parameters are h s = 10 and h r = &#963; &#8730; d. The full Sinkhorn-Knopp algorithm is implemented using GLIDE <ref type="bibr">[12]</ref>, where the source code is downloaded from the author's website <ref type="foot">1</ref> . Default settings of GLIDE are used in our experiment.</p><p>For the proposed GSF, we keep the same settings as NLM except for the intensity parameter h r where we set h r = &#963;. The omission of the factor &#8730; d is due to the fact that each Gaussian component is already a d-dimensional multivariate distribution. It is therefore not necessary to normalize the distance y iy j 2 by the factor d. For BM3D, EPLL, NL-Bayes, we downloaded the original source code from the author's website 2,3,4 . For PLE, we modified an inpainting version of the source code provided by the authors. Default settings of these algorithms are used.</p><p>Among these methods, we note that EPLL is an external denoising algorithm where a Gaussian mixture is learned from a collection of 2 million clean patches. All other methods (including GSF) are single image denoising algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>TABLE III: Configurations of Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Method</head><p>Configuration NLM <ref type="bibr">[3]</ref> Patch size 5 &#215; 5, h s = 10, h r = &#963; &#8730; d One-step <ref type="bibr">[18]</ref> Patch size 5 &#215; 5, h s = 10, h r = &#963; &#8730; d GSF (Ours) Patch size 5 &#215; 5, h s = 10, h r = &#963; GLIDE <ref type="bibr">[12]</ref> Default settings. Pilot estimate uses NLM. NL-Bayes <ref type="bibr">[37]</ref> Base mode. Default settings. PLE <ref type="bibr">[38]</ref> Default settings. Default initializations. BM3D <ref type="bibr">[32]</ref> Default settings. EPLL <ref type="bibr">[25]</ref> Default settings. External Database.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Comparison with NLM, One-step and Full Sinkhorn-Knopp</head><p>The overall results of the experiment are shown in Table <ref type="table">VI</ref>. We first compare the PSNR values of GSF with NLM, Onestep and full Sinkhorn-Knopp.</p><p>In Table <ref type="table">IV</ref> we show the average PSNR over the 10 testing images. In this table, we observe that on average One-step has a higher PSNR than NLM by 0.12dB to 1.12dB, with more significant improvements at low noise levels. This implies that the "grouping" action by the column normalization becomes less influential when noise increases. Moreover, if we compare GSF with NLM and One-step, we observe that the PSNR gain is even larger. Even at a high noise level (e.g., &#963; = 80/255 or &#963; = 100/255), the average gain from NLM is 2.5dB or more. Besides studying the trend of PSNR as a function of &#963;, it is also interesting to compare the PSNR when we increase the spatial parameter h s . In Table <ref type="table">V</ref>, we show the PSNR improvement when we use different h s &#8712; {5, 10, 20, 50, 100} for a 128 &#215; 128 image. The results show that when h s increases, the PSNR improvement also increases. One reason is that in <ref type="bibr">(7)</ref>, the spatial parameter h s controls the diagonal bandwidth of the smoothing filter W . That is, a small h s leads to a banded diagonal W with small bandwidth. In the limit when h s &#8594; 0, W will become a diagonal matrix, and hence is immune to any column normalization. Therefore, the effectiveness of the column normalization in the One-step depends on how large h s is.</p><p>The full Sinkhorn-Knopp algorithm is implemented using GLIDE <ref type="bibr">[12]</ref>. GLIDE consists of multiple steps: It first determines the weight matrix, followed by a full Sinkhorn-Knopp algorithm that symmetrizes the weight matrix. Then, it incorporates an estimator to optimally determine the number of non-zero eigenvalues and the power of eigenvalues of the smoothing filter. GLIDE can use any denoising result as its pilot estimate. For the fairness of the experiment we follow the default setting of GLIDE and use the standard NLM as the  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Comparison with NL-Bayes, PLE and EPLL</head><p>Since the proposed GSF uses a Gaussian mixture model, we compare it with three other algorithms that also use Gaussian mixture models. These algorithms are the NL-Bayes <ref type="bibr">[37]</ref>, the piecewise linear estimator (PLE) <ref type="bibr">[38]</ref> and the EPLL <ref type="bibr">[25]</ref>.</p><p>Methodologically, there are some important differences between GSF, NL-Bayes, PLE and EPLL. NL-Bayes has a grouping procedure that groups similar patches, a process similar to BM3D. The grouped patches are used to estimate the empirical mean and covariance of a single Gaussian. In GSF, there is no grouping. The other difference is that the denoising of NL-Bayes is performed by a conditional expectation over the single Gaussian. In GSF, the denoising is performed over all clusters of the GMM. Experimentally, we observe that GSF and NL-Bayes have similar performance, with NL-Bayes better in the low noise conditions and GSF better in the high noise conditions. One possible reason is that the grouping of NL-Bayes uses the standard Euclidean distance as a metric, which is not robust to noise.</p><p>PLE first estimates the model parameters using the noisy image. Then, for every patch, the algorithm selects a single Gaussian. The denoising is performed by solving a quadratic minimization and is performed for each patch individually. The algorithm iterates by improving the model parameters and the denoised estimate until convergence. GSF does not have this iterative procedure. Once the GMM is learned, the denoising is performed in closed-form. The other difference is that PLE requires a good initialization and is very sensitive to the initialization. Experimentally, we find that GSF performs better than PLE using a MATLAB code provided the authors of PLE. In this MATLAB code, the initialization was originally designed for image inpainting at a particular image resolution. Because of the sensitivity of PLE to the initializations, its performance on denoising is not particularly good. With a better initialization, we believe that PLE would improve. However, even so the gap between GSF and PLE will unlikely be significant because PLE performs worse that BM3D and EPLL.</p><p>The closest comparison to GSF is EPLL as both algorithms solve a whole-image MAP minimization with a Gaussian mixture model. To evaluate the difference between the two algorithms, we consider an experiment by feeding the noisy patches the EM algorithm to learn a GMM. The patch size is fixed at 5 &#215; 5, and the number of clusters is fixed as k = 100. We repeat the experiment by inputting the denoised result of BM3D and the oracle clean image into the EM algorithm.</p><p>From Table <ref type="table">VIII</ref>, we observe that EPLL with a noisy input performs poorly. The reason is that the original EPLL trains the GMM from 2 million clean patches. When feeded with noisy images, the GMM trained becomes a non-informative prior distribution. Moreover, in EPLL the GMM involves (&#960; i , &#181; i , &#931; i ) whereas in GSF the GMM only involves (&#960; i , &#181; i ). This is a significant reduction in the number of model parameters. When feeded with only a single image, there is insufficient training sample for EPLL.</p><p>Another observation from Table VIII is that the performance of EPLL depends heavily on the quality of the GMM. For example, if we use the result of BM3D as a pilot estimate for learning the GMM, the performance of EPLL is similar to the oracle case where we use the clean image. However, using BM3D as a pilot estimate is not a plausible approach because by running BM3D alone we can get an even higher PSNR (See Table <ref type="table">VI</ref>). This result further shows the effectiveness of the proposed GSF for single image denoising. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Complexity and Limitations</head><p>Finally, we discuss the complexity and limitations of the proposed GSF.  GSF is a one-step denoising algorithm when &#947; ij and &#181; (r) i are known. However, learning the GMM using the EM algorithm is time-consuming, and the complexity depends on the number of clusters k. In addition, since k needs to be estimated through a cross-validation scheme, the actual complexity also depends on the number of cross-validation steps. To provide readers an idea of how k changes with other system parameters, we conduct two experiments.</p><p>In Table <ref type="table">IX</ref> we show the number of clusters returned by the cross-validation scheme as we increase the noise level. As shown, the number of clusters increases when noise level reduces. This result is consistent with our intuition: As noise reduces, the grouping of patches becomes less important. In the limit when the image is noise-free, every patch will become its own cluster center. Therefore, one limitation of GSF is that for low-noise images the computing time could be very long. However, GSF is still a useful tool as its simple structure offers new insights to denoising. Now, if we fix the noise level but change the image size, the complexity of GSF also varies. In Table <ref type="table">X</ref>, we show the number of clusters as a function of image size. As a reference we also show the PSNR values of GSF and that of BM3D. The result in Table X indicates that the number of clusters increases with the image size. In the table, we also observe that BM3D performs worse than GSF for small images, but becomes better as image size increases.</p><p>Remark 3 (Subtraction of mean): It is perhaps interesting to ask whether it is possible to subtract the mean before learning the GMM, as it could reduce the number of clusters. However, from our experience, we find that this actually degrades the denoising performance. If the GMM is learned from a collection of zero-mean patches, the denoising step in (P 1 ) can only be used to denoise zero-mean patches. The mean values, which are also noisy, are never denoised. This phenomenon does not appear in EPLL (in which the GMM has a zero-mean) because the means are iteratively updated. We followed the same approach to iteratively update the means. However, we find that in general the denoising performance is still worse than the original GMM with means included. Further exploration on this would likely provide more insights into the complexity reduction issue.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSION</head><p>Motivated by the performance gain due to a column normalization step in defining the smoothing filters, we study the origin of the symmetrization process. Previous studies have shown that the symmetrization process is related to the Sinkhorn-Knopp balancing algorithm. In this paper, we further showed that the symmetrization is equivalent to an EM algorithm of learning a Gaussian mixture model (GMM). This observation allows us to generalize various symmetric smoothing filters including the Non-Local Means (NLM), the one-step Sinkhorn-Knopp and the full Sinkhorn-Knopp, and allows us to geometrically interpret the performance gain phenomenon.</p><p>Based on our findings, we proposed a new denoising algorithm called the Gaussian mixture model symmetric smoothing filters (GSF). GSF is a simple modification of the denoising framework by using the GMM prior for the maximum-aposteriori estimation. Equipped with a cross-validation scheme which can automatically determine the number of clusters, GSF shows consistently better denoising results than NLM, One-step Sinkhorn-Knopp and full Sinkhorn-Knopp. While GSF has slightly worse performance than state-of-the-art methods such as BM3D, its simple structure highlights the importance of clustering in image denoising, which seems to be a plausible direction for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGEMENT</head><p>We thank Dr. Guosheng Yu and Prof. Guillermo Sapiro for sharing the code of PLE <ref type="bibr">[38]</ref>. We also thank the anonymous reviewers for the constructive feedback that helps to significantly improve the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Proof of Proposition 1</head><p>Given an estimator z of some observation y, the SURE is defined as</p><p>Substituting ( <ref type="formula">33</ref>) into (42) yields</p><p>where &#963; 2 def = 1 n uy 2 . So it remains to determine div( z). </p><p>Substituting ( <ref type="formula">45</ref>) and ( <ref type="formula">43</ref>) into (42) completes the proof.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>GLIDE: https://users.soe.ucsc.edu/ &#8764; htalebi/GLIDE.php</p></note>
		</body>
		</text>
</TEI>
