<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Local Centroids Structured Non-negative Matrix Factorization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2017</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10041967</idno>
					<idno type="doi"></idno>
					<title level='j'>Thirty-First AAAI Conference on Artificial Intelligence (AAAI 2017)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>H Gao</author><author>F Nie</author><author>H Huang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Non-negative Matrix Factorization (NMF) has attracted much attention and been widely used in real-world applications. As a clustering method, it fails to handle the case where data points lie in a complicated geometry structure. Existing methods adopt single global centroid for each cluster, failing to capture the manifold structure. In this paper, we propose a novel local centroids structured NMF to address this drawback. Instead of using single centroid for each cluster, we introduce multiple local centroids for individual cluster such that the manifold structure can be captured by the local centroids. Such a novel NMF method can improve the clustering performance effectively. Furthermore, a novel bipartite graph is incorporated to obtain the clustering indicator directly without any post process. Experiments on both toy datasets and real-world datasets have verified the effectiveness of the proposed method.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Non-negative Matrix Factorization (NMF) has attracted extensive attention during the past two decades. It is to factorize a non-negative matrix into two non-negative factor matrices. Such a non-negativity property makes it easy to interpret for real-world data mining applications, compared with the general matrix factorization method, such as SVD and QR whose elements can be both positive and negative.</p><p>The seminal works <ref type="bibr">(Lee and Seung 1999;</ref><ref type="bibr">2001)</ref> give a parts-based-representation explanation about the factor matrices and propose how to compute them efficiently, which facilitates the development of NMF. Furthermore, the relationship between NMF and K-means, spectral clustering is disclosed in <ref type="bibr">(Ding, He, and Simon 2005)</ref> where NMF is proved to be a clustering method. To improve its clustering performance, <ref type="bibr">(Ding et al. 2006)</ref> proposed the orthogonal NMF where the indicator matrix has the orthogonal constraint. Such a constraint makes the indicator matrix unique and be easily interpreted in a clustering view <ref type="bibr">(Ding et al. 2006)</ref>. Then, NMF is widely used for clustering problems.</p><p>To address different problems, numerous variants have been proposed in the past ten years. For instance, semi-NMF and convex-NMF are proposed in <ref type="bibr">(Ding, Li, and Jordan 2010)</ref>. Semi-NMF relaxes the non-negative constraint of data matrix and basis matrix, while convex-NMF restricts the basis is a convex combination of the data points. Tri-Factorization is proposed in <ref type="bibr">(Ding et al. 2006)</ref> to factorize a non-negative data matrix into three non-negative factor matrices. This model performs clustering on both row space and column space simultaneously, which makes it suitable for document analysis. To tackle noises and outliers existing in data matrix, 2,1 -norm NMF and capped norm NMF are proposed to solve different applications <ref type="bibr">(Kong, Ding, and Huang 2011;</ref><ref type="bibr">Gao et al. 2015;</ref><ref type="bibr">Huang et al. 2014;</ref><ref type="bibr">Wang, Nie, and Huang 2015)</ref>. With these models, NMF has been widely used for many data mining applications, such as face recognition <ref type="bibr">(Gao et al. 2015</ref>), text and document mining <ref type="bibr">(Ding, Li, and Peng 2006)</ref>, human action analysis <ref type="bibr">(Wang, Nie, and Huang 2016</ref>) and so on.</p><p>As a clustering method, the two factor matrices of NMF correspond to the clustering centroid and indicator respectively. The clustering result is indicated by the maximal element of each indicator vector, which means the corresponding centroid can be viewed as the representative of that cluster. Almost all the existing NMF methods use only one centroid to represent each cluster. Such a centroid can be viewed as a global representative and it depicts its cluster very coarsely due to losing a lot of local manifold structure. Data points in real-world applications usually lie in a complicated geometry structure, the global centroid is easy to fail in depicting its cluster. Therefore, how to capture the complicated local manifold structure is important to NMF.</p><p>In this paper, we propose a novel local centroids structured NMF to handle data points lying in a complicated geometry structure. The novelty lies in three aspects. First, instead of using single centroid for each cluster, we introduce multiple local centroids for individual cluster such that the manifold structure can be captured by the local centroids. Second, although there exists multiple local centroids in each cluster, we do not employ all of them to represent each data point. To preserve the local manifold information, we only adopt the nearby centroids to represent each data point. Third, since each data point is encoded by a few nearby centroids, the clustering indicator cannot be obtained directly. Our method employs a bipartite graph and enforces it with exact c (c is the number of clusters) connected components to get the clustering indicator directly. Extensive experiments have been performed to verify the correctness and effectiveness of our novel method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>In this section, we will review the related work and give the motivation of our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non-negative Matrix Factorization</head><p>The classic NMF <ref type="bibr">(Lee and Seung 1999;</ref><ref type="bibr">2001)</ref> is to factorize X into two factor matrices as follows:</p><p>where X &#8712; d&#215;n is a non-negative data matrix, F &#8712; d&#215;k and G &#8712; n&#215;k . As a clustering method <ref type="bibr">(Ding, He, and Simon 2005)</ref>, k is set as the number of clusters. Therefore, each column of F corresponds to a cluster centroid, and each row of G is the clustering indicator of each sample. The final clustering result is indicated by the index of maximal element in each row of G.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Graph Regularized NMF</head><p>Actually, data points in real applications usually lie in a low dimensional manifold rather than distribute uniformly in the high dimensional ambient space, which means that data points have complicated relationships, i.e. the similarity should be measured via geometry structure or manifold, not the standard Euclidean distance. If the method can capture the local manifold structure, the similarity between data points can be properly measured such that the clustering results are enhanced. Graph regularized NMF (GNMF) proposed in <ref type="bibr">(Cai et al. 2011)</ref> tackles the local manifold information by incorporating a nearest neighbor graph into NMF as follows:</p><p>where L &#8712; n&#215;n is Laplacian matrix defined as L = D -W . W &#8712; n&#215;n is the affinity matrix of data points, and D &#8712; n&#215;n is a diagonal matrix defined as</p><p>With the graph regularization, two close data points in the original space are enforced to be close in the low dimensional space which is spanned by columns of F . In other words, GNMF preserves the local manifold information during factorization so that the clustering performance will be improved.</p><p>However, data points in real applications usually lie in a complicated geometry structure, such as two crescent shapes in Figure <ref type="figure">1</ref>, which is difficult to existing NMF methods. In Figure <ref type="figure">2</ref>, we show the clustering result of NMF and GNMF running on this toy data set. Both NMF and GNMF fail to cluster these data points correctly. Although GNMF incorporates local manifold information and achieves better result, yet it still fails to find the correct clustering result.</p><p>Why do these methods fail? From Figure <ref type="figure">2</ref>, we can find the centroids found by NMF and GNMF are not proper to represent its own group. Thus, some data points near to the opposite centroid are mis-clustered. In more details, there is only one centroid to represent each cluster globally. Such a global centroid can only represent a cluster very coarsely so that the local manifold structure cannot be captured. That is the reason of GNMF's failure. GNMF only incorporates local manifold information into the new representation G but not the centroid F . In this paper, we will propose a novel NMF method to address this difficult problem so that data points in Figure <ref type="figure">1</ref> can be grouped correctly. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Local Centroids Structured Non-negative Matrix Factorization</head><p>In this section, we will propose our novel non-negative matrix factorization, that is Local Centroids Structured Nonnegative Matrix Factorization. Given a non-negative dataset</p><p>d&#215;n whose distribution may have a complicated geometry structure, and assume it has c clusters. Existing NMF methods fail to tackle such a dataset since their centroids cannot capture the local manifold structure. Instead of employing single centroid for each cluster, we introduce multiple local centroids for individual cluster such that the manifold structure can be captured by the local centroids. A direct implementation is as follows:</p><p>where F &#8712; d&#215;k and G &#8712; n&#215;k . Here, k is a hyperparameter, and we set k = mc where m &gt; 1 is an integer so that each cluster will have multiple centroids. This is the difference between Eq. (3) and Eq. ( <ref type="formula">1</ref>) whose k = c. With multiple centroids, the new basis F will have more powerful ability to catpure the manifold information than existing NMF methods. Moreover, existing methods usually incorporate manifold information by the regularization term, which is not enough to capture it. Here, we incorporate manifold information into the cost function, which is the first work to do so as far as we know. However, there are two weaknesses in this model. First, although the basis F incorporates the local manifold structure, yet the new representation G ignores the local manifold information. As a result, the clustering result is suboptimal. Second, due to the multiple centroids, we cannot directly get the clustering indicator from G like existing methods. We have to perform K-means on G to get the final discrete clustering indicator. However, K-means is sensitive with initialization, and it is a non-convex problem so that it may converge to a local suboptimal solution. To address these problems, we propose the following model:</p><p>where &#955; &gt; 0, G i. is the i-th row of G, ||.|| 0 denotes 0norm and s &gt; 0 is a hyper-parameter. R(G) is the regularization with respect to G to obtain the clustering result directly, which will be defined later.</p><p>Here, to incorporate the local manifold information into new representation G, we restrict ||G i. || 0 &#8804; s which means each data point is represented by at most s centroids. Intuitively, for a data point locating in a complicated geometry structure, if using all of centroids in this geometry structure to represent it, it may be disturbed by some far away centroids, destroying the local manifold structure. On the other hand, the nearby centroids have most of manifold information so that a data point can be represented by only a few nearby centroids.</p><p>When adopting multiple local centroids, the largest element in each row of G cannot indicate the final clustering result as the conventional NMF method. To obtain the clustering indicator directly, we construct a bipartite graph with the affinity matrix S as follows:</p><p>where S &#8712; (n+k)&#215;(n+k) . Ideally, data points from a certain cluster will be represented by only the centroids from the same cluster. Therefore, they will constitute a connected components in this graph. By enforcing the bipartite graph with exact c connected components, then we can directly obtain the clustering indicator based on the graph partition.</p><p>We show an illustration of bipartite graph in Figure <ref type="figure">3</ref>. Data points in each group are represented by their own centroids, thus each group is a connected component. Furthermore, to enforce a graph with exact c connected components, we need the following theorem <ref type="bibr">(Mohar et al. 1991)</ref>. Theorem 1. The number of connected components in the graph is equal to the multiplicity c of the eigenvalue 0 of its Laplacian matrix L S . </p><p>Then, to directly obtain the final clustering result, we have our final model as follows:</p><p>where &#955; &gt; 0 is a hyper-parameter. With the second term in the objective function, the bipartite graph is enforced to have c connected components. As a result, we can directly obtain the clustering indicator according to the bipartite graph's partition .</p><p>In a summary, our proposed method adopts multiple local centroids to capture the local manifold structure. At the same time, each data point is restricted to be represented by only a few nearby centroids to preserve the local manifold information. Furthermore, a bipartite graph is constructed to be enforced with exact c connected components, thus our method can obtain the clustering result directly from the partition of this graph.</p><p>In the next section, we will propose an efficient algorithm to solve this non-convex problem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimization Algorithm</head><p>The problem defined in Eq. ( <ref type="formula">7</ref>) is a non-convex problem. Here, we adopt the alternating algorithm to solve it efficiently.</p><p>Update P When fixing F and G, we have the following problem with respect to P :</p><p>The optimal solution is the eigenvectors corresponding to the c smallest eigenvalues of L S .</p><p>Update F When fixing P and G, we have min</p><p>which can be solved by multiplicative method (Lee and Seung 2001) as follows:</p><p>Algorithm 1 Algorithm to solve Eq. ( <ref type="formula">15</ref>)</p><p>Step 1: compute the gradient</p><p>Step 2: projected gradient descent g = max(g&#951; * &#8711; g , 0)</p><p>Step 3: solve 0 -norm constraint</p><p>Update G When fixing P and F , we have</p><p>To solve Eq. ( <ref type="formula">11</ref>), we need the following lemma, which is proved in (Von Luxburg 2007). Lemma 1. Given an affinity matrix W &#8712; n&#215;n and a diagonal matrix D defined as D i,i = j W ij , for its Laplacian matrix L &#8712; n&#215;n defined as L = D -W and a matrix P &#8712; n&#215;c , we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>T r(P</head><p>where</p><p>, and p i is the i-th row of matrix P . According to Lemma 1, we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>T r(P T L</head><p>where A = Q T 21 + Q 12 . Therefore, Eq. ( <ref type="formula">11</ref>) can be reformulated as follows,</p><p>where &#945; = &#955;/2 is a large number to ensure there exists c connected components. It can be further decoupled to solve each column g i of G T separately as follows,</p><p>where x i is the i-th corresponding column of X, a i is the i-th row of A. This problem can be solved by projected gradient descent method as shown in Algorithm 1. Finally, the optimization algorithm to solve Eq. ( <ref type="formula">7</ref>) is summarized in Algorithm 2.</p><p>Algorithm 2 Algorithm to solve Eq. ( <ref type="formula">7</ref>) Initialize F and G by K-means. repeat Update P by solving Eq. ( <ref type="formula">8</ref>):</p><p>min</p><p>Update F by Eq. ( <ref type="formula">10</ref>):</p><p>Update G: solve each row of G by Algorithm 1. until Converges Initialization Since Eq. ( <ref type="formula">7</ref>) is a non-convex problem, it is important to give it a good initialization. Following <ref type="bibr">(Gao et al. 2015)</ref>, K-means is taken to initialize it. Specifically, one can perform K-means on data points to get k clusters where k = mc so that there should be multiple local centroids. F can be initialized with these centroids, and G can be initialized by the similarity between each data point and the nearest s centroids.</p><p>Determination of &#955; The stop criterion is that there are exact c connected components. Therefore, it is important to tune the parameter &#955;. Here, we use the method proposed in <ref type="bibr">(Nie, Wang, and Huang 2014)</ref> where &#955; can be automatically adjusted to obtain exact c connected components. More details can be found in the corresponding literature.</p><p>Complexity Analysis When updating P , the complexity is dominated by eigenvalue decomposition which is O((n + k) 3 ). The complexity of updating F is O(ndk), and it is also O(ndk) for updating G. Compared with conventional NMF whose complexity is O(ndc), the increased computation is mainly due to updating P .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment</head><p>In this section, we will verify the correctness and effectiveness of our proposed method on both toy dataset and real world datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Toy Dataset</head><p>Here, we run our proposed method on the complicated toy dataset as shown in Figure <ref type="figure">1</ref> where the first 201 data points belong to group 1, and the rest belong to group 2. In our experiment, we use K-means to initialize F and G, and k is set as 10. Thus, this toy dataset is clustered into 10 groups by Kmeans, just as shown in Figure <ref type="figure">4</ref>. These centroids are used to initialize F . Additionally, the parameter s in Eq. ( <ref type="formula">7</ref>) is set as 2, which means each data point is represented by only two nearest centroids, just as shown in Figure <ref type="figure">5</ref>. G is then initialized with the similarity between data points and their two nearest centroids. Apparently, there exists some connections between these two groups. After running our proposed method with the above initialization, it can be clustered perfectly.</p><p>To verify the result, we show the image of GG T in Figure <ref type="figure">6</ref>. Note that GG T denotes the similarity matrix between data points. In Figure <ref type="figure">6</ref>(a), there exist non-zero connections between the first 201 data points and the remained 201 data points. In Figure <ref type="figure">6</ref>(b), there is no connection between two groups which means that our method can cluster this complicated toy dataset correctly.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Real World Dataset</head><p>To further verify the effectiveness of our proposed method, we evaluate it on four real world benchmark datasets.</p><p>Data Description The details about four benchmark dataset are described as follows.  &#8226; ORL <ref type="bibr">(Samaria and Harter 1994</ref>) is a face recognition benchmark dataset. It consits of 40 subjects, and each subject has 10 images taken under various conditions. The task is to cluster images from the same subject together.</p><p>Here, each image is scaled to 28 &#215; 23.</p><p>&#8226; UMIST <ref type="bibr">(Graham and Allinson 1998)</ref> is also a face recognition benchmark dataset. There are 20 different subjects.</p><p>Each subject is taken images from different views, obtaining 575 images totally. Each image in our experiment is rescaled to 28 &#215; 23.</p><p>&#8226; PIE <ref type="bibr">(Sim, Baker, and Bsat 2002)</ref> is another face recognition benchmark dataset, which contains 68 subjects. Each subject is taken 42 images under different conditions.</p><p>Here, we randomly select 10 images for each subject, and each images is rescaled to 32 &#215; 32.</p><p>&#8226; COIL20 <ref type="bibr">(Nene et al. 1996</ref>) is an object recognition benchmark dataset. There are totally 20 objects, and each object has 72 images. Each image is resized to 32 &#215; 32 in our experiment.</p><p>In our experiment, each data point is normalized to have an unit length. We summarize these datasets in Table <ref type="table">1</ref>.</p><p>Experiment Setup To evaluate the performance of our proposed method, we compare it with five state-of-the-art methods as follows. With in line mode this is typeset as lim x&#8594;2 f (x) = 5</p><p>&#8226; NMF <ref type="bibr">(Lee and Seung 2001)</ref> is defined in Eq. (1). In our experiment, we set k as the number of classes.</p><p>&#8226; Orthogonal NMF (ONMF) <ref type="bibr">(Ding et al. 2006</ref>) is defined as min</p><p>, where F &#8712; d&#215;k , G &#8712; n&#215;k . Theoretically, ONMF has better clustering performance than conventional NMF. We also set k as the number of classes. </p><p>where F &#8712; d&#215;k , G &#8712; n&#215;k . Here, k is set as the number of classes. This method is robust to noises and outliers theoretically.</p><p>&#8226; Graph regularized NMF (GNMF) <ref type="bibr">(Cai et al. 2011</ref>) is defined in Eq. ( <ref type="formula">2</ref>). k is also set as the number of classes in our experiment.</p><p>Other than these NMF methods, we also compare it with K-means. For all the above NMF methods except GNMF, the clustering result is directly indicated by the largest element in each row of G. GNMF needs K-means as the postprocessing step to get clustering indicator since its G has no clear structure. We run all the methods for 10 times and report the average clustering accuracy (ACC) and normalized mutual information (NMI) in Table <ref type="table">2</ref>.</p><p>Here, we set the number of centroids k in our method around 80%-90% of the number of data points in each cluster, and each data point is restricted to be represented by 3-5 nearby centroids. This setting is consistent with our intuition and verified by our experiment on the toy dataset. Only with a lot of local centroids can a complicated geometry structure be characterized correctly. Furthermore, for a data point lying a complicated structure, only nearby centroids can provide its manifold information and the far away centroids provide considerably few information. Therefore, the setting of our method is reasonable and further verified by Table <ref type="table">2</ref>. From Table <ref type="table">2</ref>, we can find that our method and GNMF have better performance than the others due to incorporating manifold information, and our method is better than GNMF due to the multiple local centroids.</p><p>To further verify the performance of our method, we plot GG T of different methods running on the ORL dataset in Figure <ref type="figure">7</ref>. Apparently, our method has much clearer block diagonal structure than others which means our method has better clustering result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>In this paper, we propose a novel local centroids structured non-negative matrix factorization. This method can successfully handle the dataset with a complicated geometry structure. Specifically, our method use multiple local centroids to capture the local manifold structure. At the same time, the new representation preserves local manifold information by enforcing each data point to be represented by only a few nearby centroids. Furthermore, a novel bipartite graph is adopted to ensure there exist exact c connected components so that our method can get clustering result directly. The experimental results on both toy dataset and real world datasets have shown the success of our proposed method.  </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence </p></note>
		</body>
		</text>
</TEI>
