<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Exponentially Convergent Algorithms for Supervised Matrix Factorization</title></titleStmt>
			<publicationStmt>
				<publisher>Advances in Neural Information Processing Systems</publisher>
				<date>08/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10534723</idno>
					<idno type="doi"></idno>
					
					<author>Joowon Lee</author><author>Hanbaek Lyu</author><author>Weixin Yao</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Supervised matrix factorization (SMF) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. Training SMF model involves solving a nonconvex and possibly constrained optimization with at least three blocks of parameters. Known algorithms are either heuristic or provide weak convergence guarantees for special cases. In this paper, we provide a novel framework that ‘lifts’ SMF as a low-rank matrix estimation problem in a combined factor space and propose an efficient algorithm that provably converges exponentially fast to a global minimizer of the objective with arbitrary initialization under mild assumptions. Our framework applies to a wide range of SMF-type problems for multi-class classification with auxiliary features. To showcase an application, we demonstrate that our algorithm successfully identified well-known cancer-associated gene groups for various cancers.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In classical classification models, such as logistic regression, a conditional class-generating probability distribution is modeled as a simple function of the observed features with unknown parameters to be trained. However, the raw observed features may be high-dimensional, and most of them might be uninformative and hard to interpret (e.g., pixel values of an image). Therefore, it would be desirable to extract more informative and interpretable low-dimensional features prior to the classification task. For instance, the multi-layer perceptron or deep neural networks (DNN) in general <ref type="bibr">[8,</ref><ref type="bibr">9]</ref> use additional feature extraction layers prior to the logistic regression layer. This allows the model itself to learn the most effective (supervised) feature extraction mechanism and the association of the extracted features with class labels simultaneously.</p><p>Matrix factorization (MF) is a classical unsupervised feature extraction framework, which learns latent structures of complex datasets and is regularly applied in the analysis of text and images <ref type="bibr">[13,</ref><ref type="bibr">32,</ref><ref type="bibr">44]</ref>. Various matrix factorization models such as singular value decomposition (SVD), principal component analysis (PCA), and nonnegative matrix factorization (NMF) provide fundamental tools for unsupervised feature extraction tasks <ref type="bibr">[17,</ref><ref type="bibr">55,</ref><ref type="bibr">2,</ref><ref type="bibr">25]</ref>. Extensive research has been conducted to adapt matrix factorization models to perform classification tasks by supervising the matrix factorization process using additional class labels. Note that matrix factorization and classification are not necessarily aligned objectives, so some degree of trade-off is necessary when seeking to achieve both goals simultaneously. Supervised matrix factorization (SMF) provides systematic approaches for such multi-objective tasks. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. The general framework of SMF was introduced in <ref type="bibr">[33]</ref>. A similar SMF-type framework of discriminative K-SVD was proposed for face recognition <ref type="bibr">[61]</ref>. A stochastic formulation of SMF was proposed in <ref type="bibr">[30]</ref>. SMF has also found numerous applications in various other problem domains, including speech and emotion recognition <ref type="bibr">[16]</ref>, music genre classification <ref type="bibr">[62]</ref>, concurrent brain network inference <ref type="bibr">[62]</ref>, structure-aware clustering <ref type="bibr">[59]</ref>, and object recognition <ref type="bibr">[27]</ref>. Recently, supervised variants of NMF, as well as PCA, were proposed in <ref type="bibr">[5,</ref><ref type="bibr">26,</ref><ref type="bibr">47]</ref>. See also the survey work of <ref type="bibr">[15]</ref> on SMF.</p><p>Various SMF-type models have been proposed in the past two decades. We divide them into two categories depending on whether the extracted low-dimensional feature or the feature extraction mechanism itself is supervised. We refer to them as feature-based and filter-based SMF, respectively. Feature-based SMF models include the classical ones by Mairal et al. (see, e.g., <ref type="bibr">[33,</ref><ref type="bibr">30]</ref>) as well as the more recent model of convolutional matrix factorization by <ref type="bibr">[22]</ref> for a contextual text recommendation system. Filter-based SMF models have been studied more recently in the supervised matrix factorization literature, most notably from supervised nonnegative matrix factorization <ref type="bibr">[5,</ref><ref type="bibr">26]</ref> and supervised PCA <ref type="bibr">[47]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contributions</head><p>In spite of vast literature on SMF, due to the high non-convexity of the associated optimization problem (see ( <ref type="formula">4</ref>)), algorithms for SMF mostly lack rigorous convergence analysis and there has not been any algorithm that provably converges to a global minimizer of the objective at an exponential rate. We summarize our contributions below.</p><p>&#8226; We formulate a general class of SMF-type models (including both the feature-and the filter-based ones) with high-dimensional features as well as low-dimensional auxiliary features (see ( <ref type="formula">4</ref>)). &#8226; We provide a novel framework that 'lifts' SMF as a low-rank matrix estimation problem in a combined factor space and propose an efficient algorithm that converges exponentially fast to a global minimizer of the objective with an arbitrary initialization (Theorem 3.5) . We numerically validate our theoretical results (see Fig. <ref type="figure">2</ref>). &#8226; We theoretically compare the robustness of filter-based and feature-based SMF, establishing that the former is computationally more robust (see Theorem 3.5) while the latter is statistically more robust (see Theorem 4.1). &#8226; Applying our method to microarray datasets for cancer classification, we show that not only it is competitive against benchmark methods, but it is able to identify groups of genes including well-known cancer-associated genes (see Fig. <ref type="figure">3</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Notations</head><p>Throughout this paper, we denote by R p the ambient space for data equipped with standard inner project h&#8226;, &#8226;i that induces the Euclidean norm k&#8226;k. We denote by {0, 1, . . . , &#63743;} the space of class labels with &#63743; + 1 classes. For a convex subset &#8677; in an Euclidean space, we denote &#8679; &#8677; the projection operator onto &#8677;. For an integer r 1, we denote by &#8679; r the rank-r projection operator for matrices. For a matrix A = (a ij ) ij 2 R m&#8677;n , we denote its Frobenius, operator (2-), and supremum norm by</p><p>For each 1 &#63743; i &#63743; m and 1 &#63743; j &#63743; n, we denote A[i, :] and A[:, j] for the ith row and the jth column of A, respectively. For each integer n 1, I n denotes the n &#8677; n identity matrix. For square symmetric matrices A, B 2 R n&#8677;n , we denote A B if v T Av &#63743; v T Bv for all unit vectors v 2 R n . For two matrices A and B, we denote [A, B] and [A k B] the matrices obtained by concatenating (stacking) them by horizontally and vertically, respectively, assuming matching dimensions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Model setup</head><p>Suppose we are given with n labeled signals (y i , x i , x 0 i ) for i = 1, . . . , n, where y i 2 {0, 1, . . . , &#63743;} is the label, x i 2 R p is a high-dimensional feature of i, and x 0 i 2 R q is a low-dimensional auxiliary feature of i (p q). For a vivid context, think of x i as the X-ray image of a patient i and x 0 i denoting some biological measurements, such as gender, smoking status, and body mass index. When making predictions of y i , we use a suitable r (&#8999; p) dimensional compression of the high-dimensional feature x i as well as the low-dimensional feature x 0 i as-is. We assume such compression is done by some matrix of (latent) factors W = [w 1 , . . . , w r ] 2 R p&#8677;r that is reconstructive in the sense that the observed signals x i can be reconstructed as (or approximated by) the linear transform of the 'atoms' w 1 , . . . , w r 2 R p for some suitable 'code' h i 2 R r . More concisely, X data = [x 1 , . . . , x n ] &#8673; WH, where H = [h 1 , . . . , h n ] 2 R r&#8677;n . In practice, we can choose r to be the approximate rank of data matrix X data (e.g., by finding the elbow of the scree plot).</p><p>Figure <ref type="figure">1</ref>: Overall scheme of the proposed method for SMF-H. Now, we state our probabilistic modeling assumption. Fix parameters W 2 R p&#8677;r , h i 2 R r , 2 R r&#8677;&#63743; , and 2 R q&#8677;&#63743; . Let h : R ! [0, 1) be a score function (e.g., h(&#8226;) = exp(&#8226;) for multinomial logistic regression). We assume y i is a realization of a random variable whose conditional distribution is specified as [P (y i = 0 | x i , x 0 i ) , . . . , P (y i = &#63743; | x i , x 0 i )] = g(a i ) := C[1, h(a i,1 ), . . . , h(a i,&#63743; )],</p><p>(1) where C is the normalization constant and a i = (a i,1 , . . . , a i,&#63743; ) 2 R &#63743; is the activation for y i defined in two ways, depending on whether we use a 'feature-based' or 'filter-based' SMF model:</p><p>One may regard ( , ) as the 'multinomial regression coefficients' with input feature (h i , x 0 i ) or (W T x i , x 0 i ). In (2), we may regard the code h i (coming from x i &#8673; Wh i ) or the 'filtered signal' W T</p><p>x i as the r-dimensional compression of x i . Note that these two coincide if we have perfect factorization x i = Wh i and the factor matrix W are orthonormal, i.e., W T W = I r , but we do not necessarily make such an assumption.</p><p>There are some notable differences between SMF-H and SMF-W when predicting the unknown label of a test point. If we are given a test point (x test , x 0 test ), the predictive probabilities for its unknown label y test is given by (1) with activation a computed as in <ref type="bibr">(2)</ref>. This only involves straightforward matrix multiplications for SMF-W, which can also be viewed as a forward propagation in a multilayer perceptron <ref type="bibr">[35]</ref> with W acting as the first layer weight matrix (hence named 'filter'). However, for SMF-H, one needs to solve additional optimization problems for testing. Namely, for every single test signal (x test , x 0 test ), its correct code representation h test needs to be learned by solving the following 'supervised sparse coding' problem (see <ref type="bibr">[33]</ref>):</p><p>A more efficient heuristic testing method for SMF-H is by approximately computing h test by only minimizing the second term in <ref type="bibr">(3)</ref>.</p><p>In order to estimate the model parameters (W, H, , ) from observed training data (x i , y i ) for i = 1, . . . , n, we consider the following multi-objective optimization problem:</p><p>where</p><p>, and `(&#8226;) is the classification loss measured by the negative log-likelihood:</p><p>In ( <ref type="formula">4</ref>), the tuning parameter &#8672; controls the trade-off between the two objectives of classification and matrix factorization. The above is a nonconvex problem involving four blocks of parameters that could have additional constraints (e.g., bounded norm). This problem entails many classical models as special cases. When &#8672; 1 so that effectively only the second term in ( <ref type="formula">4</ref>) is being minimized with respect to W and H, it becomes the classical matrix factorization problem <ref type="bibr">[31,</ref><ref type="bibr">28,</ref><ref type="bibr">29]</ref>, where one seeks to find factor matrix W that can best reconstruct the feature vectors X data via the factorization X data &#8673; WH. In Figure <ref type="figure">3</ref>, we will demonstrate that the best reconstructive factor matrix could be significantly different from the supervised factor matrix learned by SMF and may not be very effective for classification tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Related works</head><p>The SMF training problem ( <ref type="formula">4</ref>) is a nonconvex and possibly constrained optimization problem, generally with non-unique minimizers. Since it is difficult to solve exactly, approximate procedures such as block coordinate descent (BCD) (see, e.g., <ref type="bibr">[57]</ref>) are often used. Such methods utilize the fact that the objective function in ( <ref type="formula">4</ref>) is convex in each of the four (matrix) variables. Such an algorithm proceeds by iteratively optimizing for only one block while fixing the others (see <ref type="bibr">[33,</ref><ref type="bibr">5,</ref><ref type="bibr">26,</ref><ref type="bibr">47]</ref>). However, convergence analysis or statistical estimation bounds of such algorithms are quite limited. Appealing to general convergence results for BCD methods (e.g., <ref type="bibr">[18,</ref><ref type="bibr">58]</ref>), one can at most guarantee asymptotic convergence to the stationary points or polynomial convergence to Nash equilibria or of the objective (4), modulo carefully verifying the assumptions of these general results. We also remark that <ref type="bibr">[30]</ref> provided a rigorous justification of the differentiability of a feature-based SMF model.</p><p>The main finding of our work is that the non-convexity of the SMF problem ( <ref type="formula">4</ref>) is 'benign', in the sense that there exists an algorithm globally convergent to a global optimum at an exponential rate. We use a 'double-lifting' technique that converts the nonconvex SMF problem (4) into a low-rank factored estimation with a convex objective. This is reminiscent of the tight relation between a low-rank matrix estimation and a nonconvex factored estimation problem, which has been actively employed in a body of works in statistics and optimization <ref type="bibr">[3,</ref><ref type="bibr">45,</ref><ref type="bibr">36,</ref><ref type="bibr">64,</ref><ref type="bibr">53,</ref><ref type="bibr">56,</ref><ref type="bibr">42,</ref><ref type="bibr">41,</ref><ref type="bibr">51]</ref>. Our exponentially convergent SMF algorithms are versions of low-rank projected gradient descent in the algorithm <ref type="bibr">(44)</ref> that operate in the double-lifted space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Sketch of key idea</head><p>Our key idea to solve ( <ref type="formula">4</ref>) is to transform it into a variant of the low-rank matrix estimation problem <ref type="bibr">(6)</ref> and then use a Low-rank Projected Gradient Descent (LPGD) algorithm <ref type="bibr">(7)</ref>:</p><p>In <ref type="bibr">(6)</ref>, one seeks to minimize an objective f w.r.t. a paired matrix parameter Z = [&#10003;, ] within a convex constraint set &#8677; and an additional rank constraint rank(&#10003;) &#63743; r. In <ref type="bibr">(7)</ref>, &#8679; r denotes applying rank-r projection on the first factor &#10003; while keeping the same. Problem (6) encompasses a variety of significant problems, such as matrix regression <ref type="bibr">[11,</ref><ref type="bibr">38]</ref> and matrix completion <ref type="bibr">[48,</ref><ref type="bibr">37]</ref>. For these problems, algorithms of type <ref type="bibr">(7)</ref> have been studied in <ref type="bibr">[41,</ref><ref type="bibr">56]</ref>.</p><p>To illustrate how the SMF problem (4) transforms into a low-rank matrix estimation (6), we consider a much simpler version of SMF-H. That is, instead of combining matrix factorization with multinomial logistic regression for multi-class classification problems, we combine it with linear regression. Thus, the response variable y in this discussion assumes all values in the real line. For additional simplicity, we assume there are no auxiliary features X aux . We seek to solve matrix factorization and linear regression problems simultaneously for data matrix X data 2 R p&#8677;n and response variable</p><p>This is a three-block optimization problem involving three factors W 2 R p&#8677;r , H 2 R r&#8677;n and 2 R r&#8677;1 , which is nonconvex and computationally challenging to solve exactly. Instead, consider reformulating this nonconvex problem as the following matrix factorization problem:</p><p>Indeed, we now seek to find two decoupled matrices (instead of three), one for T and W stacked vertically, and the other for H. A similar idea of matrix stacking was used in <ref type="bibr">[61]</ref> for discriminative K-SVD. Proceeding one step further, another important observation we make is that it is also equivalent to finding a single matrix</p><p>&#8677;n of rank at most r that minimizes the function f in <ref type="bibr">(9)</ref>, which is convex (specifically, quadratic) in &#10003;: (See Fig. <ref type="figure">1</ref> Training).</p><p>For SMF-W, consider the following analogous linear regression model: min</p><p>where the right-hand side above is obtained by replacing H with W T X data in <ref type="bibr">(9)</ref>. Note that the objective function depends only on the product of the two matrices W and [ , H]. Then, we may further lift it as the low-rank matrix estimation problem by seeking a single matrix &#10003; := [W , WH] 2 R p&#8677;(1+n) of rank at most r that solves <ref type="bibr">(6)</ref> with f being the function in <ref type="bibr">(10)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Algorithm</head><p>Motivated by the observation we made before, we rewrite SMF-H in (4) as</p><p>where</p><p>&#8677;n , and &#8677; is a convex subset of R (&#63743;+p)&#8677;n &#8677; R q&#8677;&#63743; . We have added a L 2 -regularization term for A and with coefficient 0, which will play a crucial role in well-conditioning <ref type="bibr">(11)</ref>.</p><p>For solving <ref type="bibr">(11)</ref>, we propose to use the LGPD algorithm <ref type="bibr">(7)</ref>: We iterate gradient descent followed by projecting onto the convex constraint set &#8677; of the combined factor [&#10003;, ] and then perform rank-r projection of the first factor &#10003; = [A k B] via truncated SVD until convergence. Once we have a solution [&#10003; ? , ? ] to <ref type="bibr">(11)</ref>, we can use SVD of &#10003; ? to obtain a solution to (4). Let &#10003; ? = U&#8963;V T denote the SVD of &#10003;. Since rank(&#10003; ? ) &#63743; r, &#8963; is an r &#8677; r diagonal matrix of singular values of &#10003;. Then U 2 R (&#63743;+p)&#8677;r and V 2 R n&#8677;r are semi-orthonormal matrices, that is, U T U = V T V = I r . Then since &#10003; ? = [( ? ) T k W ? ]H ? , we can take</p><p>V T and [( ? ) T k W ? ] = U&#8963; 1/2 . We summarize this approach of solving (4) for SMF-H in Algorithm 1. Here, SVD r denotes rank-r truncated SVD and the projection operators &#8679; &#8677; and &#8679; r are defined in Subsection 1.1.</p><p>As for SMF-W, we can rewrite (4) with additional L 2 -regularizer for A = W and as</p><p>where &#63743;+n) and &#8677; 2 R p&#8677;(&#63743;+n) &#8677; R q&#8677;&#63743; is a convex set. Algorithm 1 for SMF-W follows similar reasoning as before with the reformulation above.</p><p>By using randomized truncated SVD for the efficient low-rank projection in Algorithm 1, the periteration complexity is O(pn min(n, p)), while that for the nonconvex algorithm is O((pr + q)n).</p><p>While the LPGD algorithm is in general more expensive per iteration than the nonconvex method, the iteration complexity is only O(log &#9999; 1 ) thanks to the exponential convergence to the global optimum (will be discussed in Theorem 3.5). To our best knowledge, the nonconvex algorithm for SMF does not have any guarantee to converge to a global optimum, and the iteration complexity of the nonconvex SMF method to reach an &#9999;-stationary point is at best O(&#9999; 1 ) using standard analysis. Hence for &#9999; small enough, Algorithm 1 achieves an &#9999;-accurate global optimum for SMF with a total computational cost comparable to a nonconvex SMF algorithm to achieve an &#9999;-stationary point.</p><p>Algorithm 1 Lifted PGD for SMF Input: X data 2 R p&#8677;n ; X 0 aux 2 R q&#8677;n (auxiliary features); Y label 2 {0, 1, . . . , &#63743;} n Parameters: &#8999; &gt; 0 (stepsize); N 2 N (iterations); r 1 (rank); 0 (L 2 -reg. param.)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Global convergence guarantee</head><p>We have discussed that one can cast the SMF problem (4) as the following 'factored estimation problem' min T,S, f (TS T , ). Note that such problems generally do not have a unique minimizer due to the 'rotation invariance'. Namely, let R be any r &#8677; r orthonormal (rotation) matrix (i.e.,</p><p>Hence the best one is to obtain parameters up to rotation that globally minimize the objective value. Our main result, Theorem 3.5, establishes that this can be achieved by Algorithm 1 at an exponential rate. First, we introduce the following technical assumptions (3.1-3.3). Assumption 3.1. (Bounded activation) The activation a 2 R &#63743; defined in (2) assumes bounded norm, i.e., kak &#63743; M for some constant M 2 (0, 1). Assumption 3.2. (Bounded eigenvalues of covariance matrix) Denote = [ 1 , . . . , n ] 2 R (p+q)&#8677;n , where</p><p>, where X aux = [x 0 1 , . . . , x 0 n ]. Then, there exist constants , + &gt; 0 such that for all n 1, </p><p>Assumption 3.1 limits the norm of the activation a as an input for the classification model in ( <ref type="formula">4</ref>) is bounded. This is standard in the literature (see, e.g., <ref type="bibr">[36,</ref><ref type="bibr">60,</ref><ref type="bibr">23]</ref>) in order to uniformly bound the eigenvalues of the Hessian of the (multinomial) logistic regression model. Assumption 3.2 introduces uniform bounds on the eigenvalues of the covariance matrix. Assumption 3.3 introduces uniform bounds on the eigenvalues of the &#63743; &#8677; &#63743; observed information as well as the first derivative of the predictive probability distribution (see <ref type="bibr">[10]</ref> and Sec. D in Appendix for more details). Under Assumption 3.1 and the multinomial logistic regression model h(&#8226;) = exp(&#8226;), one can derive Assumption 3.3 with a simple expression for the bounds &#8629; &#177; , as discussed in the following remark. Remark 3.4 (Multinomial Logistic Classifier). Let `denote the negative log-likelihood function in <ref type="bibr">(5)</ref>, where we take the multinomial logistic model with the score function h(&#8226;) = exp(&#8226;). Denote ( &#7715;1 , . . . , &#7715;&#63743; ) := r a `(y, a) and &#7718;(y, a) := r a r a T `(y, a). Then in this special case, we have &#7715;j (y, a) = g j (a) 1(y = j) and &#7718;(y, a) i,j = g i (a) (1(i = j) g j (a)) (See ( <ref type="formula">28</ref>) and ( <ref type="formula">30</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Now define the following quantities:</head><p>Now, we state a special case of our first main result, specifically when the model is 'correctly specified', allowing the rank-r SMF model to effectively account for the observed data. This implies the existence of a 'low-rank stationary point' of F , as also demonstrated in <ref type="bibr">[56]</ref>. However, we also handle the general case in Appendix (see Theorem D.1). Theorem 3.5. (Exponential convergence) Let Z t := [&#10003; t , t ] denote the iterates of Algorithm 1. Assume 3.1-3.3 hold. Let &#181; and L be as in <ref type="bibr">(15)</ref>, fix &#8999; 2 ( 1 2&#181; , 3 2L ), and let &#8674; := 2(1 &#8999; &#181;) 2 (0, 1). Suppose L/&#181; &lt; 3 and let</p><p>In the statement above, we write kZk 2</p><p>Note that we may view the ratio L/&#181; that appears in Theorem 3.5 as the condition number of the SMF problem in (4), whereas the ratio L &#8676; /&#181; &#8676; for &#181; &#8676; := &#8629; and L &#8676; := + &#8629; + as the condition number for the multinomial classification problem. These two condition numbers are closely related. First, note that for any given &#181; &#8676; , L &#8676; and sample size n, we can always make L/&#181; &lt; 3 by choosing sufficiently large &#8672; and so that Theorem 3.5 holds. However, using large L 2 -regularization parameter may perturb the original objective in (4) too much that the converged solution may not be close to the optimal solution. Hence, we may want to take as small as possible. Setting = 0 leads to</p><p>For SMF-W, if the multinomial classification problem is well-conditioned (L &#8676; /&#181; &#8676; &lt; 3) and the ratio &#8672;/n is in the above interval, then SMF-W enjoys exponential convergence in Theorem 3.5. However, the condition for SMF-H in ( <ref type="formula">16</ref>) is violated, so L 2 -regularization is necessary for guaranteeing exponential convergence of SMF-H.</p><p>The proof of Theorem 3.5 involves two steps: (1) We establish a general exponential convergence result for the general LPGD algorithm <ref type="bibr">(7)</ref> in Theorem C.2 in Appendix. <ref type="bibr">(2)</ref> We compute the Hessian eigenvalues of the SMF objectives ( <ref type="formula">11</ref>)-( <ref type="formula">12</ref>) and apply the result to obtain Theorem 3.5. The proof contains two challenges: first, the low-rank projection in ( <ref type="formula">7</ref>) is not non-expansive in general. To overcome this, we show that the iterates closely approximate certain 'auxiliary iterates' which exhibit exponential convergence towards the global optimum. Secondly, the second-order analysis is highly non-trivial since the SMF problem (4) has a total of four unknown matrix factors that are intertwined through the joint multi-class classification and matrix factorization tasks. See Appendix D for the details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Statistical estimation guarantee</head><p>In this section, we formulate a generative model for SMF (4) and state statistical parameter estimation guarantee. Fix dimensions p q, and let n 1 be possibly growing sample size, and fix unknown true parameters B ? 2 R p&#8677;n , C ? 2 R q&#8677;n , ? 2 R q&#8677;&#63743; . In addition, fix A ? 2 R &#63743;&#8677;n for SMF-H and A ? 2 R p&#8677;&#63743; for SMF-W. Now suppose that class label, data, and auxiliary features are drawn i.i.d. according to the following joint distribution: 8 &gt; &gt; &lt; &gt; &gt; :</p><p>where each " i (resp., " 0 i ) are p &#8677; 1 (resp., q &#8677; 1) vector of i.i.d. mean zero Gaussian entries with variance 2 (resp., ( 0 ) 2 ). We call the above the generative SMF model. In what follows, we will assume that the noise levels and 0 are known and focus on estimating the four-parameter matrices.</p><p>The (L 2 -regularized) negative log-likelihood of observing triples (y i , x i , x 0 i ) for i = 1, . . . , n is given as</p><p>, where c is a constant and F is as in <ref type="bibr">(11)</ref> or ( <ref type="formula">12</ref>) depending on the activation type with tuning parameter &#8672; =<ref type="foot">foot_0</ref> 2 2 . The L 2 regularizer in F can be understood as Gaussian prior for the parameters and interpreting the right-hand side above as the negative logarithm of the posterior distribution function (up to a constant) in a Bayesian framework. Note that the problem of estimating A and B are coupled due to the low-rank model assumption in <ref type="bibr">(17)</ref>, while the problem of estimating C is standard and separable, so it is not of our interest. The joint estimation problem for [A, B, ] is equivalent to the corresponding SMF problem (4) with tuning parameter &#8672; = (2 2 ) 1 . This and Theorem 3.5 motivate us to estimate the true parameters A ? , B ? , and ? by the output of Algorithm 1 with &#8672; = (2 2 ) 1 for O(log n) iterations. Now we give the second main result. Roughly speaking, it states that the estimated parameter Z t is within the true parameter Z ? = [A ? , B ? , ? ] within O(log n/ p n) with high probability, provided that the noise variance 2 is small enough and the SMF objective ( <ref type="formula">11</ref>)-( <ref type="formula">12</ref>) is well-conditioned. ). Then following holds with probability at least 1 1 n : For all t 1 and n 1,</p><p>We remark that Theorem 4.1 implies that SMF-H is statistically more robust than SMF-W. Namely, in order to have an arbitrarily accurate estimate with high probability, one needs to have &#181; p n log n. Combining with the expression in <ref type="bibr">(15)</ref> and the well-conditioning assumption L/&#181; &lt; 3, one needs to require &#8672; = &#8998;(n), hence small noise variance 2 = O(1/n) for SMF-W. However, for SMF-H, this is guaranteed whenever 2 = o(1/( p n log n)) and &#8673; &#181;.</p><p>5 Simulation and Numerical Validation a Semi-synthetic MNIST data (" = 2) b Job postings data c LPGD vs.BCD MNIST Job postings SMF SMF SMF SMF SMF SMF SMF SMF We numerically verify Theorem 3.5 on a semi-synthetic dataset generated by using MNIST image dataset <ref type="bibr">[24]</ref> (p = 28 2 = 784, q = 0, n = 500, &#63743; = 1) and a text dataset named 'Real / Fake Job Posting Prediction' [1] (p = 2840, q = 72, n = 17880, &#63743; = 1). Details about these datasets are in Sec. G in Appendix. 1 We used Algorithm 1 with rank r = 2 for MNIST and r = 20 for job postings datasets. For all experiments, = 2 and stepsize &#8999; = 0.01 were used.</p><p>We validate the theoretical exponential convergence results of our LPGD algorithm (Algorithm 1) in Figure <ref type="figure">2</ref>. Note that the convexity and smoothness parameters &#181; and L in Theorem 3.5 are difficult to compute exactly. In practice, cross-validation of hyperparameters is usually employed. For &#8672; 2 {0.1, 1, 5, 10, 20} in Figure <ref type="figure">2</ref>, we indeed observe exponential decay of training loss as dictated by our theoretical results for Algorithm 1. We also observe that the exponential rate of decay in training loss increases as &#8672; increases. According to Theorem 3.5, the contraction coefficient is &#8674; = (1 &#8999; &#181;), which decreases in &#8672; since &#181; increases in &#8672; (see <ref type="bibr">(15)</ref>). The decay for large &#8672; 2 {10, 20} seems even superexponential. Furthermore, 2c shows that our LPGD algorithm converges significantly faster than BCD for training both SMF-H and SMF-W models at &#8672; 2 {5, 10} (other values of &#8672; omitted).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Application: Microarray Analysis for Cancer Classification</head><p>We apply the proposed methods to two datasets from the Curated Microarray Database (CuMiDa) <ref type="bibr">[14]</ref>. CuMiDa provides well-preprocessed microarray data for various cancer types for various machine-learning approaches. One consists of 54,676 gene expressions from 51 subjects with binary labels indicating pancreatic cancer; Another we use has 35,982 gene expressions from 289 subjects with binary labels indicating breast cancer. The primary purpose of the analysis is to classify cancer patients solely based on their gene expression. We compare the accuracies of the proposed methods -SMF-W and SMF-H with a binary logistic classifier trained using Algorithm 1 -against the following benchmark algorithms: SMF-W and SMF-H trained using BCD; 1-dimensional sevenlayer Convolutional Neural Networks (CNN); three-layer Feed-Forward Neural Networks (FFNN); Naive Bayes (NB); Support Vector Machine (SVM); Random Forest (RF); Logistic Regression with Matrix Factorization by truncated SVD (MF-LR). For the last benchmark method MF-LR, we use rank-r SVD to factorize X train &#8673; U&#8963;V T and take W = U&#8963; and H = V T . For testing, we use W T</p><p>X test as input to logistic regression for both filter and feature methods since We normalize gene expression for stable matrix factorization and interpretability of regression coefficients. We split each data into 50% of the training set and 50% of the test set and repeat the comparison procedure 5 times. A scree plot is used to determine the rank r. Other parameters are chosen through 5-fold cross-validation (&#8672; 2 {0.1, 1, 10} and 2 {0.1, 1, 10}), and the algorithms are repeated in 1,000 iterations or until convergence. As can be seen in the table in Figure <ref type="figure">3a</ref>, the proposed methods show the best performance for both types of cancers.</p><p>An important advantage of SMF methods is that they provide interpretable results in the form of 'supervised factors'. Each supervised factor consists of a latent factor and the associated regression coefficient. That is, once we train the SMF model (for &#63743; = 1) and learn factor matrix W = [w 1 , . . . , w r ] 2 R p&#8677;r and vector of regression coefficients = [ 1 , . . . , r ] 2 R 1&#8677;r , each column w j of W describes a latent factor and the corresponding regression coefficient j tells us how w j is associated with class labels. The pairs (w j , j ), which form supervised latent factors, provide insights into how the trained SMF model perceives the classification task. See Fig. <ref type="figure">1</ref> for illustration.</p><p>In the context of microarray analysis for cancer research, each w j corresponds to a weighted group of genes (which we call a 'principal gene group') and j represents the strength of its association with cancer. SMF learns supervised gene groups (Fig. <ref type="figure">3a</ref>, <ref type="figure">c</ref>) with significantly higher classification accuracy than the unsupervised gene groups (Fig. <ref type="figure">3b</ref>, <ref type="figure">d</ref>). In Fig. <ref type="figure">3a</ref>, <ref type="figure">c</ref>, both gene groups (consisting of p genes) have positive regression coefficients, so they are positively associated with the log odds of the predictive probability of the corresponding cancer. Remarkably, our method detected the well-known oncogene BRCA1 of breast cancer and other various genes (in Fig. <ref type="figure">3e</ref>) that are known to be prognostic markers of breast/pancreatic cancer (see Human Protein Atlas <ref type="bibr">[49]</ref>) in these groups of extreme coefficients (top five). The high classification accuracy suggests that the identified supervised principal gene groups may be associated with the occurrence of breast/pancreatic cancer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion and Limitations</head><p>We propose an exponentially convergent algorithm for nonconvex SMF training using new lifting techniques. Our analysis demonstrates strong convergence and estimation guarantee. We compare the robustness of filter-based and feature-based SMF, finding that the former is computationally more robust while the latter is statistically more robust. The algorithm's exponential convergence is numerically verified. In cancer classification using microarray data analysis, our algorithm successfully identifies discriminative gene groups for various cancers and shows potential for identifying important gene groups as protein complexes or pathways in biomedical research. Our analysis framework can be extended to more complex classification models, such as combining a feed-forward deep neural network with a matrix factorization objective. While our convergence analysis holds in certain parameter regimes, we discuss them in detail. We have tested our method and convergence analysis on various real-world datasets but recommend further numerical verification on a wider range of datasets.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We provide our implementation of Algorithm 1 in our code repository https://github.com/ljw9510/ SMF/tree/main.</p></note>
		</body>
		</text>
</TEI>
