<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Global Geometric Analysis of Maximal Coding Rate Reduction</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of Machine Learning Research</publisher>
				<date>06/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10517691</idno>
					<idno type="doi"></idno>
					<title level='j'>International Conference in Machine Learning (ICML)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Peng Wang</author><author>Huikang Liu</author><author>Druv Pai</author><author>Yaodong Yu</author><author>Zhihui Zhu</author><author>Qing Qu</author><author>Yi Ma</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The maximal coding rate reduction (MCR 2 ) objective for learning structured and compact deep representations is drawing increasing attention, especially after its recent usage in the derivation of fully explainable and highly effective deep network architectures. However, it lacks a complete theoretical justification: only the properties of its global optima are known, and its global landscape has not been studied. In this work, we give a complete characterization of the properties of all its local and global optima, as well as other types of critical points. Specifically, we show that each (local or global) maximizer of the MCR 2 problem corresponds to a low-dimensional, discriminative, and diverse representation, and furthermore, each critical point of the objective is either a local maximizer or a strict saddle point. Such a favorable landscape makes MCR 2 a natural choice of objective for learning diverse and discriminative representations via first-order optimization methods. To validate our theoretical findings, we conduct extensive experiments on both synthetic and real data sets.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>In the past decade, deep learning has exhibited remarkable empirical success across a wide range of engineering and scientific applications <ref type="bibr">(LeCun et al., 2015)</ref>, such as computer vision <ref type="bibr">(He et al., 2016;</ref><ref type="bibr">Simonyan &amp; Zisserman, 2014)</ref>, natural language processing <ref type="bibr">(Sutskever et al., 2014;</ref><ref type="bibr">Vaswani et al., 2017)</ref>, and health care <ref type="bibr">(Esteva et al., 2019)</ref>, to name a few. As argued by <ref type="bibr">Bengio et al. (2013)</ref>; <ref type="bibr">Ma et al. (2022)</ref>, one major factor contributing to the success of deep learning is the ability of deep networks to perform powerful nonlinear feature learning by converting the data distribution to a compact and structured representation. This representation greatly facilitates various downstream tasks, including classification <ref type="bibr">(Dosovitskiy et al., 2020)</ref>, segmentation <ref type="bibr">(Kirillov et al., 2023)</ref>, and generation <ref type="bibr">(Saharia et al., 2022)</ref>.</p><p>Based on the theory of data compression and optimal coding <ref type="bibr">(Ma et al., 2007)</ref>, <ref type="bibr">Chan et al. (2022)</ref>; <ref type="bibr">Yu et al. (2020)</ref> proposed a principled and unified framework for deep learning to learn a compact and structured representation. Specifically, they proposed to maximize the difference between the coding rate of all features and the sum of coding rates of features in each class, which is referred to as maximal coding rate reduction (MCR<ref type="foot">foot_1</ref> ). This problem is presented in Problem (4) and visualized in Figure <ref type="figure">1</ref>(a). Here, the coding rate measures the "compactness" of the features, which is interpreted as the volume of a particular set spanned by the learned features: a lower coding rate implies a more compact feature set<ref type="foot">foot_0</ref> . Consequently, the MCR 2 objective aims to maximize the volume of the set of all features while minimizing the volumes of the sets of features from each class. Motivated by the structural similarities between deep networks and unrolled optimization schemes for sparse coding <ref type="bibr">(Gregor &amp; LeCun, 2010;</ref><ref type="bibr">Monga et al., 2021)</ref>, <ref type="bibr">Chan et al. (2022)</ref> constructed a new deep network based on an iterative gradient descent scheme to maximize the MCR 2 objective. 2  Notably, each component of this deep network has precise optimization and geometric interpretations. Moreover, it has achieved strong empirical performance on various vision and language tasks <ref type="bibr">(Chu et al., 2023;</ref><ref type="bibr">Yu et al., 2023a)</ref>.</p><p>Although the MCR 2 -based approach to deep learning is conceptually "white-box" and has achieved remarkable empirical performance, its theoretical foundations have been relatively under-explored. In fact, the effective feature learning mechanism and "white-box" network architecture design based on MCR 2 are direct consequences of these foundations, and understanding them will pave the way to improv-  ing model interpretability and training efficiency of deep networks. Nevertheless, a comprehensive theoretical understanding of the MCR 2 problem remains lacking. In this work, we take a step towards filling this gap by studying its optimization properties. Notably, analyzing these properties, including local optimality and global landscape, of the MCR 2 objective is extremely challenging. To be precise, its objective function (see Problem ( <ref type="formula">4</ref>)) is highly non-concave<ref type="foot">foot_7</ref> and complicated, as it involves quadratic functions and the difference between log-determinant functions. To the best of our knowledge, characterizing the local optimality and global optimization landscape of the MCR 2 problem remains an open question.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Our Contributions</head><p>In this work, we study the optimization foundations of the MCR 2 -based approach to deep learning. Towards this goal, we characterize the local and global optimality of the regularized MCR 2 problem and analyze its global optimization landscape (see Problem ( <ref type="formula">5</ref>)). Our contributions can be highlighted as follows.</p><p>Characterizing the local and global optimality. For the regularized MCR 2 problem, we derive the closed-form expressions for its local and global optima for the first time.</p><p>Our characterization shows that each local maximizer of the regularized MCR 2 problem is within-class compressible and between-class discriminative in the sense that features from the same class belong to a low-dimensional subspace, while features from different classes belong to different orthogonal subspaces. Besides these favorable properties, each global maximizer corresponds to a maximally diverse representation, which attains the highest possible dimension in the space.</p><p>Studying the global optimization landscape. Next, we show that the regularized MCR 2 function possesses a benign global optimization landscape, despite its complicated structures. More precisely, each critical point is either a local maximizer or strict saddle point of the regularized MCR 2 problem; see Figure <ref type="figure">1</ref>(b). Consequently, any gradient-based optimization, such as (stochastic) gradient descent, with random initialization can escape saddle points and at least converge to a local maximizer efficiently.</p><p>Finally, we conduct extensive numerical experiments on synthetic data sets to validate our theoretical results. Moreover, we use the regularized MCR 2 objective to train deep networks on real data sets. These experimental results constitute an application of the rigorously derived MCR 2 theory to more realistic and complex deep learning problems.</p><p>Our results not only establish optimization foundations for the MCR 2 problem but also yield some important implications for the MCR 2 -based approach to deep learning. Namely, our theoretical characterizations of local and global optimality offer a compelling explanation for the empirical observations that both deep networks constructed via gradient descent applied to the MCR 2 objective and overparameterized deep networks trained by optimizing the MCR 2 objective learn low-dimensional, discriminative, and diverse representations. These results align with the motivations of <ref type="bibr">Chan et al. (2022)</ref>; <ref type="bibr">Yu et al. (2020)</ref> for employing the MCR 2 principle for deep learning, and elucidate the outstanding performance of MCR 2 -based neural networks across a wide range of vision and language tasks <ref type="bibr">(Chu et al., 2023;</ref><ref type="bibr">Yu et al., 2024)</ref>. Moreover, our results underscore the potential of MCR 2 -based approaches to serve as a cornerstone for future advancements in deep learning, offering a principled approach to pursuing structured and compact representations in practical applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Related Work</head><p>Low-dimensional structures in deep representation learning. In the literature, it has long been believed that the role of deep networks is to learn certain (nonlinear) low-dimensional and informative representations of the data <ref type="bibr">(Hinton &amp; Salakhutdinov, 2006;</ref><ref type="bibr">Ma et al., 2022)</ref>. For example, <ref type="bibr">Papyan et al. (2020)</ref> showed that the features learned by cross-entropy (CE) loss exhibit a neural collapse phenomenon during the terminal phase of training, where the features from the same class are mapped to a vector while the features from different classes are maximally linearly separable. The MCR 2 -based approach to deep learning. The MCR 2 -based approach to deep learning for seeking structured and compact representations was first proposed by <ref type="bibr">Yu et al. (2020)</ref>. Notably, they provided a global optimality analysis of the MCR 2 problem (4) with additional rank constraints on the feature matrix of each class. <ref type="bibr">Chan et al. (2022)</ref> designed a new multi-layer deep network architecture, named ReduNet, based on an iterative gradient descent scheme for maximizing the MCR 2 objective. To learn selfconsistent representations, <ref type="bibr">Dai et al. (2022)</ref> extended this approach to the closed-loop transcription (CTRL) framework, which is formulated as a max-min game to optimize a modified MCR 2 objective. This game was shown to have global equilibria corresponding to compact and structured representations <ref type="bibr">(Pai et al., 2022)</ref>. Recently, <ref type="bibr">Yu et al. (2023b)</ref> showed that a transformer-like architecture named CRATE, which obtains strong empirical performance <ref type="bibr">(Chu et al., 2023;</ref><ref type="bibr">Yu et al., 2023a;</ref><ref type="bibr">2024)</ref>, can be naturally derived through an iterative optimization scheme for maximizing the sparse rate reduction objective, which is an adaptation to sequence data of the MCR 2 objective studied in this work.</p><p>Notation. Given a matrix A &#8712; R m&#215;n , we use &#8741;A&#8741; to denote its spectral norm, &#8741;A&#8741; F to denote its Frobenius norm, and a ij its (i, j)-th element. Given a vector a &#8712; R d , we use &#8741;a&#8741; to denote its &#8467; 2 -norm, a i its i-th element, and diag(a) the diagonal matrix with a on its diagonal. Given a positive integer n, we denote by [n] the set {1, . . . , n}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Given a set of integers {n</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Problem Setup</head><p>In this section, we first review the basic concepts of MCR 2 for deep representation learning in Section 2.1, and then introduce our studied problem in Section 2.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">An Overview of MCR 2</head><p>In deep representation learning, given data {x i } m i=1 &#8838; R n from multiple (say K) classes, the goal is to learn neuralnetwork representations of these samples that facilitate downstream tasks. Recent empirical studies have shown that good features can be learned for tasks such as classification or autoencoding by using heuristics to promote either the contraction of samples in the same class <ref type="bibr">(Rifai et al., 2011)</ref> or the contrast of samples between different classes <ref type="bibr">(He et al., 2020;</ref><ref type="bibr">Oord et al., 2018)</ref> during the training of neural networks. Notably, <ref type="bibr">Chan et al. (2022)</ref>; <ref type="bibr">Yu et al. (2020)</ref> unified and formalized these practices and demonstrated that the MCR 2 objective is an effective objective to learn within-class compressible and between-class discriminative representations of the data.</p><p>The formulation of MCR 2 . In this work, we mainly consider an MCR 2 objective for supervised learning problems. Specifically, let For each k &#8712; [K], let Z k &#8712; R d&#215;m k be the matrix whose columns are the features in the k-th class. Without loss of generality, we reorder the samples in a class-by-class manner, so that we can write the matrix of all features as</p><p>(1)</p><p>On one hand, to make features between different classes discriminative or contrastive, one can maximize the lossy coding rate of all features in Z, as argued in <ref type="bibr">(Chan et al., 2022;</ref><ref type="bibr">Yu et al., 2020)</ref>, as follows:</p><p>where &#1013; &gt; 0 is a prescribed quantization error. <ref type="foot">4</ref> On the other hand, to make features from the same class compressible or contractive, one can minimize the average lossy coding rate of features in the k-th class as follows:</p><p>Consequently, a good representation tends to maximize the difference between the coding rate for the whole and that for each class as follows:</p><p>This is referred to as the principle of maximal coding rate reduction in <ref type="bibr">(Chan et al., 2022;</ref><ref type="bibr">Yu et al., 2020)</ref>. It is worth mentioning that this principle can be extended to selfsupervised and even unsupervised learning settings, where we learn the label vectors {&#960; k } K k=1 during training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">The Regularized MCR 2 Problem</head><p>Due to the Frobenius norm constraints, it is a tremendously difficult task to analyze Problem (4) from an optimizationtheoretic perspective, as all the analysis would occur on a product of spheres instead of on Euclidean space. Therefore, we consider the Lagrangian formulation of (4). This can be viewed as a tight relaxation or even an equivalent problem of (4) whose optimal solutions agree under specific settings of the regularization parameter; see Proposition 3.2. Specifically, the formulation we study, referred to henceforth as the regularized MCR 2 problem, is as follows:</p><p>where &#955; &gt; 0 is the regularization parameter. Remark that our study on this problem applies meaningfully to at least two approaches to learning deep representations using the MCR 2 principle.</p><p>Applications of our formulation to deep representation learning via unrolled optimization. The first approach, as argued by <ref type="bibr">Chan et al. (2022)</ref>; <ref type="bibr">Yu et al. (2023a)</ref>, is to construct a new deep network architecture, i.e., ReduNet <ref type="bibr">(Chan et al., 2022)</ref> or CRATE <ref type="bibr">(Yu et al., 2023a)</ref>, based on an iterative gradient descent scheme to optimize the MCR 2 -type objective. In this approach, each layer of the constructed network approximates a gradient descent step to optimize the MCR 2 -type objective given the input representation. The key takeaway is that these networks approximately implement gradient descent directly on the representations, so our analysis of the optimization properties of the MCR 2 -type objective translates to explanations of the corresponding properties of the learned representations and architectures of these deep networks. In particular, our argument that the optima and optimization landscape of (5) are favorable directly translates to a justification of the correctness of learned representations of the ReduNet and a validation of its architecture design. Moreover, this study enables principled improvement of deep network architectures constructed via unrolled optimization by leveraging more advanced optimization techniques better suited for problems with benign landscapes. This can improve model interpretability and efficiency.</p><p>Applications of our formulation to deep representation learning with standard neural networks. In the second approach, one parameterizes the feature mapping f &#920; (&#8226;) via standard deep neural networks such as a multi-layer perceptron or ResNet <ref type="bibr">(He et al., 2016)</ref>, and treats the MCR 2 -type objective like other loss functions applied to outputs of a neural network, such as mean-squared error or cross-entropy loss. Studying Problem (5) from this perspective would require us to optimize over &#920; instead of over Z. This new optimization problem would be extraordinarily difficult to analyze, because modern neural networks have nonlinear interactions across many layers, so the parameters &#920; would affect the final representation Z in a complex way. Fortunately, since modern neural networks are often highly over-parameterized, they can interpolate or approximate any continuous function in the feature space <ref type="bibr">(Lu et al., 2017)</ref>, so we may omit these constraints by assuming the unconstrained feature model, where z i for all i &#8712; [N ] are treated as free optimization variables <ref type="bibr">(Mixon et al., 2020;</ref><ref type="bibr">Yaras et al., 2022;</ref><ref type="bibr">Zhu et al., 2021;</ref><ref type="bibr">Wang et al., 2022a)</ref>. Consequently, studying the optimization properties of Problem (5) provides valuable insights into the structures of learned representations and the efficiency of training deep networks using MCR 2 -type objectives.</p><p>Difficulties of analyzing Problem (5). Although <ref type="bibr">Problem (5)</ref> has no constraints, one can observe that Problem (5) is highly non-concave due to the quadratic form Z k Z T k and the difference of log-determinant functions. Notably, this problem shares similarities with low-rank matrix factorization problems. However, it employs the log-determinant function instead of the Frobenius norm, and the computation of the objective gradient involves matrix inverses. Therefore, from an optimization point of view, it is extremely challenging to analyze Problem (5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Main Results</head><p>In this section, we first characterize the local and global optimal solutions of Problem (5) in Section 3.1, and then analyze the global landscape of the objective function in Section 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Characterization of Local and Global Optimality</head><p>Although Problem (5) is highly non-concave and involves matrix inverses in its gradient computation, we can still explicitly characterize its local and global optima as follows. </p><p>where (a) </p><p>We defer the proof to Section 4.1 and Appendix D.1. In this theorem, we explicitly characterize the local and global optima of Problem (5). Intuitively, this demonstrates that the features represented by each local maximizer of Problem ( <ref type="formula">5</ref>) are low-dimensional and discriminative in the sense that (i) Within-class compressible: According to (7), at each local maximizer, the features from the same class belong to the same low-dimensional linear subspace. (ii) Between-class discriminative: It follows from ( <ref type="formula">7</ref>) and U T k U l = 0 for all k &#824; = l that, at each local maximizer, the features from different classes belong to different subspaces that are orthogonal to each other. Moreover, the features represented by each global maximizer of Problem ( <ref type="formula">5</ref>) are not only low-dimensional and discriminative but also diverse in the sense that (iii) Maximally Diverse Representation: According to K k=1 r k = min{m, d}, at each global maximizer, the total dimension of all features is maximized to match the highest dimension that it can achieve in the feature space.</p><p>Quality of local versus global optima. Our above discussion explains the merits of achieving both local and global optima. At each maximizer, the representations are within-class compressible and between-class discriminative (Theorem 3.1 (i)). Moreover, global maximizers further satisfy that the representations are all maximally diverse (Theorem 3.1 (ii) (a)). If all classes were balanced, i.e., m 1 = &#8226; &#8226; &#8226; = m K , then Theorem 3.1 (ii) (b) would not apply, and these properties would be all that Theorem 3.1 asserts. In this case, global optima would clearly be desired over local optima. However, in the unbalanced case, the situation is more complex, because Theorem 3.1 (ii) (b) would apply. It says that for global optima, the classes with the smallest numbers of samples would fill to the largest dimension possible, and the very largest classes could collapse to 0, an undesirable situation. A dramatic example of this is when m 1 &gt; &#8226; &#8226; &#8226; &gt; m K &gt; d, for then any global optimum would have rank(Z K ) = d and Z 1 , . . . , Z K-1 all collapse to 0. Overall, in the unbalanced case, global optima may not always correspond to the best representations. In particular, local optima with more equitable rank distributions (like bigger classes span more dimensions) which are still maximally diverse (i.e., ranks of each class sum to the dimension d) could be preferred in applications. As demonstrated in Section 5.1, these kinds of potentially useful local optima are realized in experiments, even with unbalanced classes.</p><p>Relation between Problems (4) and (5). Based on the characterization of global optimality in Theorem 3.1, we show the following proposition that establishes the relationship between the constrained MCR 2 problem (4) and the regularized MCR 2 problem (5) in terms of their global solutions under an appropriate choice of the regularization parameter. The proof of this result can be found in Appendix D.2. Proposition 3.2. Suppose that the number of training samples in each class is the same, i.e., m 1 = &#8226; &#8226; &#8226; = m K , and the coding precision &#1013; &gt; 0 satisfies</p><p>The following statements hold: (i) If m &lt; d and the regularization parameter in Problem (5) is set as</p><p>Problems (4) and (5) have the same global solution set.</p><p>(ii) If m &#8805; d, d/K is an integer, and the regularization parameter in Problem (5) is set as</p><p>the global solution set of Problem (4) is a subset of that of Problem (5).</p><p>According to this proposition, if &#1013; and &#955; are appropriately chosen for Problem (5), when m &lt; d, Problems (4) and ( <ref type="formula">5</ref>) are equivalent in terms of their global optimal solutions; when m &#8804; d, Problem ( <ref type="formula">5</ref>) is a tight Lagrangian relaxation of Problem (4) such that the global solution set of the former contains that of the latter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Analysis of Global Optimization Landscape</head><p>While we have characterized the local and global optimal solutions in Theorem 3.1, it remains unknown whether these solutions can be computed efficiently using GD to solve Problem (5), as GD may get stuck at a saddle point. Fortunately, <ref type="bibr">Sun et al. (2015)</ref>; <ref type="bibr">Lee et al. (2016)</ref> showed that if a function is twice continuously differentiable and satisfies strict saddle property, i.e., each critical point is either a local minimizer or a strict saddle point<ref type="foot">foot_9</ref> , GD converges to its local minimizer almost surely with random initialization. We investigate the global optimization landscape of Problem ( <ref type="formula">5</ref>) by characterizing all of its critical points as follows.</p><p>Theorem 3.3 (Benign optimization landscape). Suppose that the number of training samples in the k-th class is</p><p>Given a coding precision &#1013; &gt; 0, if the regularization parameter satisfies (6), it holds that any critical point Z of Problem (5) that is not a local maximizer is a strict saddle point.</p><p>We defer the proof to Section 4.2 and Appendix D.3. Here, we make some remarks on this theorem and also on the consequences of the results derived so far.</p><p>Differences from existing results on the MCR 2 problem.  <ref type="formula">4</ref>) with Frobenius norm constraints on each Z k in the UFM. However, their analysis requires an additional rank constraint on each Z k and only characterizes globally optimal representations. In contrast, our analysis eliminates the need for the rank constraint, and we characterize local and global optimality in Problem (5), as well as its optimization landscape. Interestingly, we demonstrate that the features represented by each local maximizer -not just global maximizers -are also compact and structured. Furthermore, we demonstrate that the regularized MCR 2 objective ( <ref type="formula">5</ref>) is a strict saddle function. To the best of our knowledge, Theorems 3.1 and 3.3 constitute the first analysis of local optima and optimization landscapes for MCR 2 objectives. According to <ref type="bibr">Daneshmand et al. (2018)</ref>; <ref type="bibr">Lee et al. (2016)</ref>; <ref type="bibr">Xu et al. (2018)</ref>, Theorems 3.1 and 3.3 imply that low-dimensional and discriminative representations can be efficiently found by (stochastic) GD on Problem (5) from a random initialization.</p><p>Comparison to existing landscape analyses in nonconvex optimization. In recent years, there has been a growing body of literature exploring optimization landscapes of non-convex problems in machine learning and deep learning. These include low-rank matrix factorization <ref type="bibr">(Ge et al., 2017;</ref><ref type="bibr">Sun et al., 2018;</ref><ref type="bibr">Chi et al., 2019;</ref><ref type="bibr">Zhang et al., 2020)</ref>, community detection <ref type="bibr">(Wang et al., 2021;</ref><ref type="bibr">2022b)</ref>, dictionary learning <ref type="bibr">(Sun et al., 2017;</ref><ref type="bibr">Qu et al., 2020;</ref><ref type="bibr">Zhai et al., 2020)</ref>, and deep neural networks <ref type="bibr">(Sun et al., 2020;</ref><ref type="bibr">Yaras et al., 2022;</ref><ref type="bibr">Zhou et al., 2022;</ref><ref type="bibr">Zhu et al., 2021;</ref><ref type="bibr">Jiang et al., 2024)</ref>. The existing analyses in the literature cannot be applied to the MCR 2 problem due to its special structure, which involves the log-determinant of all features minus the sum of the log-determinant of features in each class. Our work contributes to the literature on optimization landscape analyses of non-convex problems by showing that the MCR 2 problem has a benign optimization landscape. Our approach may be of interest to analyses of the landscapes of other intricate loss functions in practical applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Proofs of Main Results</head><p>In this section, we sketch the proofs of our main theorems in Section 3. The complete proofs can be found in Sections B and C of the appendix. For ease of exposition, let</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Analysis of Optimality Conditions</head><p>Our goal in this subsection is to characterize the local and global optima of Problem (5). Towards this goal, we first provide an upper bound on the objective function F in Problem (5). In particular, this upper bound is tight when the blocks {Z k } K k=1 are orthogonal to each other. This result is a direct consequence of <ref type="bibr">(Chan et al., 2022, Lemma 10)</ref></p><p>where the inequality becomes equality if and only if</p><p>Next, we study the following set of critical points, which are between-class discriminative (i.e., Z T k Z l = 0): </p><p>, and (iii) the singular values satisfy  where</p><p>and</p><p>This proposition shows that each critical point that is between-class discriminative (i.e., Z T k Z l = 0) exhibits a specific structure: the singular values of Z k can only take on two possible values, &#963; k and &#963; k . We will leverage this structure and further show that Z is a strict saddle if there exists a Z k with a singular value &#963; k .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Analysis of Optimization Landscape</head><p>Our goal in this subsection is to show that the function F in Problem (5) has a benign optimization landscape. Towards this goal, we denote the set of critical point of F by</p><p>According to (13), we divide the critical point set X into two disjoint sets Z and Z c , i.e., X = Z &#8746; Z c , where</p><p>Moreover, according to Proposition 4.2, we further divide Z into two disjoint sets Z 1 and Z 2 , i.e., Z = Z 1 &#8746; Z 2 . Here,</p><p>where &#963; k,i (Z k ) denotes the i-th largest singular value of Z k . Our first step is to show that any point belonging to Z 1 is a local maximizer, while any point belonging to Z 2 is a strict saddle point.</p><p>Proposition 4.3. Consider the setting of Theorem 3.3. Suppose that Z &#8712; Z. Then, the following statements hold:</p><p>Next, we proceed to the second step to show that any point belonging to Z c is a strict saddle point. It suffices to find</p><p>With the above preparations that characterize all the critical points, we can prove Theorem 3.1 and Theorem 3.3. We refer the reader to Appendix D for the detailed proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experimental Results</head><p>In this section, we first conduct numerical experiments on synthetic data in Section 5.1 to validate our theoretical results, and then on real-world data sets using deep neural networks in Section 5.2 to further support our theory. All codes are implemented in Python mainly using NumPy and PyTorch. All of our experiments are executed on a computing server equipped with NVIDIA A40 GPUs. Due to space limitations, we defer some implementation details and additional experimental results to Appendix E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Validation of Theory for Solving Problem (5)</head><p>In this subsection, we employ GD for solving Problem (5) with different parameter settings. We visualize the optimization dynamics and structures of the solutions returned by GD to verify and validate Theorems 3.1 and 3.3.</p><p>Verification of Theorem 3.1. In this experiment, we set the parameters in Problem (5) as follows: the dimension of features d = 100, the number of classes K = 4, the num-  Verification of Theorem 3.3. In this experiment, we maintain the same setting as above, except that the number of samples in each class is equal. We first fix m = 200 and vary d &#8712; {40, 80, 120}, and then fix d = 50 and vary d &#8712; {100, 200, 400} to run GD. We plot the distances between function values of the iterates to the optimal value, which is computed according to (7) in Theorem 3.1, against the iteration numbers in Figure <ref type="figure">3</ref>. We observe that GD with random initialization converges to an optimal solution at a linear rate. This indicates that the MCR 2 has a benign global landscape, which supports Theorem 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Training Deep Networks Using Regularized MCR 2</head><p>In this subsection, we conduct numerical experiments on the image datasets <ref type="bibr">MNIST (LeCun et al., 1998)</ref> and <ref type="bibr">CIFAR-10 (Krizhevsky et al., 2009)</ref> to provide evidence that our theory also applies to deep networks. More specifically, we employ a multi-layer perceptron network with ReLU activation as the feature mapping z = f &#920; (x) with output dimension 32 for MNIST and 128 for CIFAR-10. Then, we train the network parameters &#920; via Adam <ref type="bibr">(Kingma &amp; Ba, 2014</ref>) by optimizing Problem (5).</p><p>Experimental setting and results. In the experiments, we randomly sample a balanced subset with K classes and m samples from MNIST or CIFAR-10, where each class has the same number of samples. We set &#955; = 0.001 and = 0.5. For different subsets with corresponding values of (m, K), we run experiments and report the function value F obtained by training deep networks and the optimal value F * computed using the closed-form solution in Theorem 3.1  <ref type="bibr">(Bj&#246;rck &amp; Golub, 1973)</ref> between the class subspaces:</p><p>where the columns of U k &#8712; R d&#215;r k are the right singular vectors corresponding to the top r k singular values of Z k defined in ( <ref type="formula">14</ref>) and r k is its rank<ref type="foot">foot_10</ref> for each k &#8712; [K]. In particular, when s is smaller, the spaces spanned by each pair Z k and Z l for k &#824; = l are closer to being orthogonal to each other. Then, we record the value s in Table <ref type="table">1</ref> in different settings. Moreover, we visualize the pairwise cosine similarities between learned features on MNIST and CIFAR-10 when (m, K) = (1500, 6) and (2500, 10) in Figure <ref type="figure">4</ref>.</p><p>We observe from Table <ref type="table">1</ref> that the function value returned by training deep networks is extremely close to the global optimal value of Problem (5) and from the value s and Figure <ref type="figure">4</ref> that the features from different classes are nearly orthogonal to each other. These observations, together with Theorems 3.1 and 3.3, indicate that Problem (5) retains its optimization properties even when Z is parameterized by a neural network. Our theoretical analysis of Problem (5) thus illustrates a qualitative picture of training deep networks with the regularized MCR 2 objective.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this work, we provided a complete characterization of the global landscape of the MCR 2 objective, a highly nonconcave and nonlinear function used for representation learning. We characterized all critical points, including the local and global optima, of the MCR 2 objective, and showed that -surprisingly -it has a benign global optimization landscape. These characterizations provide rigorous justifications for why such an objective can be optimized well using simple algorithms such as gradient-based methods. In particular, we show that even local optima of the objective leads to geometrically meaningful representations. Our experimental results on synthetic and real-world datasets clearly support this new theoretical characterization. With the global landscape clearly revealed, our work paves the way for exploring better optimization strategies, hence better deep neural network architectures, for optimizing the MCR 2 objective more efficiently and effectively. For future work, it is natural to extend our analysis to Problem (4) with deep network parameterizations. It is also interesting to study the sparse MCR 2 objective, which has led to high-performance transformer-like architectures <ref type="bibr">(Yu et al., 2023a;</ref><ref type="bibr">b)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impact Statement</head><p>This paper advances the state of knowledge in machine learning theory, including non-convex optimization for representation learning. We do not anticipate any particular societal or ethical impacts of this paper, besides those usually associated with theoretically oriented machine learning papers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Supplementary Material</head><p>The organization of the supplementary material is as follows: In Appendix A, we introduce preliminary setups and auxiliary results for studying the MCR 2 problem. Then, we prove the technical results concerning the global optimality of Problem (5) in Appendix B and the optimization landscape of Problem (5) in Appendix C, respectively. In Appendix D, we prove the main theorems in Theorem 3.1 and Theorem 3.3. Finally, we provide more experimental setups and results in Appendix E.</p><p>Besides the notions introduced earlier, we shall use BlkDiag(X 1 , . . . , X K ) to denote the block diagonal matrix whose diagonal blocks are X 1 , . . . , X K .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Preliminaries</head><p>In this section, we first introduce the first-order optimality condition and the concept of a strict saddle point for F (&#8226;) in Problem (5) in Section A.1, and finally present auxiliary results about matrix computations and properties of the logdeterminant function in Section A.2. Recall that</p><p>and &#945;, &#945; k are defined in (11). To simplify our development, we write R c (Z, &#960; k ) in (3) as</p><p>Therefore, we can write F (Z) in Problem (5) into</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1. Optimality Conditions and Strict Saddle Points</head><p>To begin, we compute the gradient and Hessian (in bilinear form along a direction D &#8712; R d&#215;m ) of R(&#8226;) in (2) as follows:</p><p>where X := I n + &#945;ZZ T and &#945; is defined in (11). Note that we can compute the gradient and Hessian of R c (&#8226;) in (21) using the same approach. Based on the above setup, we define the first-order optimality condition of Problem (5) as follows.</p><p>Definition A.1. We say that Z &#8712; R d&#215;m is a critical point of Problem (5) if &#8711;F (Z) = 0, i.e.,</p><p>where &#945; and &#945; k are defined in (11).</p><p>According to <ref type="bibr">Jin et al. (2017)</ref>; <ref type="bibr">Lee et al. (2019)</ref>, we define the strict saddle point, i.e., a critical point that has a direction with strictly positive curvature<ref type="foot">foot_11</ref> , of Problem (5) as follows: Definition A.2. Suppose that Z &#8712; R d&#215;m is a critical point of Problem (5). We say that Z is its strict saddle point if there exists a direction</p><p>where</p><p>Remark that for the MCR 2 problem, strict saddle points include classical saddle points with strictly positive curvature as well as local minimizers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Auxiliary Results</head><p>We provide a matrix inversion lemma, which is also known as Sherman-Morrison-Woodbury formula.</p><p>Lemma A.3 (Matrix inversion lemma). For any Z &#8712; R d&#215;m , we have</p><p>We next present a commutative property for the log-determinant function and the upper bound for the coding rate function.</p><p>We refer the reader to <ref type="bibr">(Chan et al., 2022, Lemma 8 &amp; Lemma 10)</ref> for the detailed proofs. Here, let Z = U &#931;V T be a singular value decompositon (SVD) of Z &#8712; R d&#215;m , where r = rank(Z) &#8804; min{m, d}, U &#8712; O d&#215;r , &#931; &#8712; R r&#215;r is a diagonal matrix, and V &#8712; O m&#215;r .</p><p>Lemma A.4 (Commutative property). For any Z &#8712; R d&#215;m and &#945; &gt; 0, we have</p><p>where the equality holds if and only if</p><p>Finally, we show that the objective function of Problem ( <ref type="formula">5</ref>) is invariant under the block diagonal orthogonal matrices.</p><p>Lemma A.6. For any</p><p>we have </p><p>This implies &#8711;F (ZO) = &#8711;F (Z). Finally, using (24), we have </p><p>where the equality holds if and only if Z T k Z l = 0 for all 1 &#8804; k &#824; = l &#8804; K. Substituting this into (5) directly yields (12).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. Proof of Proposition 4.2</head><p>Proof of Proposition 4.2. Let Z &#8712; Z be arbitrary, where Z is defined in (13). It follows from</p><p>This, together with ( <ref type="formula">23</ref>), yields that &#8711;F (Z) = 0 is equivalent to</p><p>Obviously, Z k = 0 is a solution of the above equation for each k &#8712; [K], which satisfies Z T k Z l = 0 for all l &#824; = k. Now, we consider Z k &#824; = 0, and thus</p><p>&#931;k 0 0 0</p><p>be a singular value decomposition (SVD) of</p><p>. Substituting this SVD into (33) yields for all k &#8712; [K],</p><p>which is equivalent to</p><p>Using &#931; k = BlkDiag( &#931;k , 0), we further obtain</p><p>Since &#931;k is a diagonal matrix with diagonal entries being positive, we have for all k &#8712; [K],</p><p>This implies for each i &#8712;</p><p>Therefore, we obtain that &#963; 2 k,i &gt; 0 for each i &#8712; [r k ] is a positive root of the following quadratic equation with a variable x &#8712; R:</p><p>where</p><p>According to (6), one can verify that for each k &#8712; [K],</p><p>This yields that the above quadratic equation has positive roots as follows. For each i &#8712; [r k ] and k &#8712; [K], we have</p><p>Finally, using Z T k Z l = 0 and (34), we obtain</p><p>These, together with (34), yields ( <ref type="formula">14</ref>).</p><p>Conversely, suppose that each block Z k of Z satisfies Z k = 0 or takes the form ( <ref type="formula">14</ref>) for some</p><p>and &#963; k,i &gt; 0 satisfying (15). We are devoted to showing Z &#8712; Z. It is straightforward to verify that Z T k Z l = 0 for all 1 &#8804; k &#824; = l &#8804; K. This, together with Lemma 4.1, implies</p><p>where the last equality is due to ( <ref type="formula">15</ref>), ( <ref type="formula">16</ref>), ( <ref type="formula">17</ref>), and (36). Then, we compute</p><p>where the second equality follows from ( <ref type="formula">27</ref>). This, together with ( <ref type="formula">23</ref>), yields</p><p>where the last equality follows from ( <ref type="formula">14</ref>) and ( <ref type="formula">39</ref>). Therefore, we have &#8711;F (Z) = 0 as desired. This, together with</p><p>Proof of Proposition 4.3. For each Z &#8712; Z, it follows from Lemma 4.1 that</p><p>where f k is defined in (32). Suppose that there exists k &#8712; [K] such that r k = 0, i.e., Z k = 0. According to ( <ref type="formula">24</ref>) and (32), we compute for any</p><p>where the second equality follows from m k &#945; k /m = &#945; according to (11). This implies 0 is a local maximizer of f k (Z k ). Suppose to the contrary that r k &gt; 0 for all k &#8712; [K]. For each Z &#8712; Z, using Lemma 4.1 with Z T k Z l = 0 for all k &#824; = l, (14), and (32), we have</p><p>where the second equality is due to (14) and Lemma A.4. For ease of exposition, let</p><p>Using ( <ref type="formula">15</ref>), ( <ref type="formula">36</ref>), (37), and (38), one can verify that h</p><p>. This yields that h k (&#963; k ) is a local minimizer and h(&#963; k ) is a local maximizer. This, together with (42) and the fact that 0 is a local maximizer of f k (Z k ), implies (i) and (ii).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2. Proof of Proposition 4.4</head><p>Proof of Proposition 4.4. Note that Z &#8712; R d&#215;m is a critical point that satisfies (25). Suppose that rank (Z) = r and rank (Z k ) = r k for all k &#8712; [K]. Obviously, we have r k &#8804; min{m k , d} for all k &#8712; [K] and K k=1 r k &#8804; r &#8804; min{m, d}. Now, let ZZ T = Q&#923;Q T be an eigenvalue decomposition of ZZ T &#8712; S d + , where Q &#8712; O d&#215;r and &#923; &#8712; R r&#215;r is a diagonal matrix with diagonal entries being positive eigenvalues of ZZ T . Suppose that ZZ T has p distinct positive eigenvalues, where 1 &#8804; p &#8804; r. Let &#955; 1 &gt; &#8226; &#8226; &#8226; &gt; &#955; p &gt; 0 be its distinct eigenvalue values with the corresponding multiplicities being h 1 , . . . , h p &#8712; N + , respectively. Obviously, we have p i=1 h i = r. Therefore, we write</p><p>where</p><p>According to Lemma A.6, we can see that Z is a critical point with curvature if and only if ZO is a critical point with the same curvature for each</p><p>. Substituting this into (25) in Definition A.1 gives</p><p>This is equivalent to</p><p>This yields that each column of Z k is an eigenvector of Z for each k &#8712; [K]. This, together with the decomposition in ( <ref type="formula">44</ref>), yields that we can permute the columns of Z k such that the columns belonging to the space spanned by Q i are rearranged together. Let s k,i &#8712; N denote the number of columns of Z k that belong to the space spanned by</p><p>there exists an a column permutation matrix</p><p>where</p><p>This, together with (21) and Lemma A.5, yields</p><p>Moreover, let</p><p>Using this and (45), we have</p><p>This, together with (2), Lemma A.5, and</p><p>Characterize the structure of critical points. Now, for each k &#8712;</p><p>be a singular value decomposition (SVD) of</p><p>According to ( <ref type="formula">5</ref>), ( <ref type="formula">23</ref>), ( <ref type="formula">45</ref>), and ( <ref type="formula">46</ref>), we have for all k &#8712; [K] and i &#8712; [p],</p><p>Substituting the block forms of</p><p>k in (49) into the above equation and rearranging the terms, we obtain for all k &#8712; [K] and i &#8712; [p],</p><p>Using X = I + &#945;ZZ T and rearranging the terms, we have for all k &#8712; [K] and i &#8712; [p],</p><p>Since</p><p>0 for all i &#824; = j, and (50), we have for all k &#8712; [K] and i &#8712; [p],</p><p>Rearranging the terms in the above equation, we obtain for each k &#8712; [K] and i &#8712; [p],</p><p>Substituting this back to (52) yields for each k &#8712; [K] and i &#8712; [p],</p><p>This, together with ( <ref type="formula">49</ref>) and ( <ref type="formula">53</ref>), yields</p><p>It follows from this and (47) that</p><p>Since there exists k &#824; = l &#8712; [K] such that Z T k Z l &#824; = 0, we can assume without loss of generality that</p><p>This, together with Z (i) T Z (j) = 0 for all i &#824; = j, implies that there exists i * &#8712; [p] such that z 1,i1 , z 2,i2 are both columns of Z (i * ) . Without loss of generality, suppose that z 1,i1 and z 2,i2 are Since Z</p><p>, where</p><p>, the second equality follows from</p><p>k,1 , and (60). Summing up the above equality for all k &#8712;</p><p>where the second equality follows from the definition of &#946; i in (53). Finally, we compute</p><p>where the inequality is due to &#8741;q 2 &#8741; = &#8741;c 2 &#8741; &#824; = 0 and</p><p>Given a matrix Z &#8712; R d&#215;m , let ZZ T = Q&#923;Q T be an eigenvalue decomposition of ZZ T &#8712; S d + , where Q &#8712; O d&#215;r and &#923; &#8712; R r&#215;r is a diagonal matrix with diagonal entries being positive eigenvalues of ZZ T . Suppose that ZZ T has p distinct positive eigenvalues, where 1 &#8804; p &#8804; r. Let &#955; 1 &gt; &#8226; &#8226; &#8226; &gt; &#955; p &gt; 0 be its distinct eigenvalue values with the corresponding multiplicities being h 1 , . . . , h p &#8712; N + , respectively. Obviously, we have p i=1 h i = r. Therefore, we write  <ref type="formula">19</ref>), ( <ref type="formula">20</ref>), (ii) in Proposition 4.3 and Proposition 4.4, if Z &#8712; Z c &#8746; Z 2 , then Z is a strict saddle point. This, together with X = Z c &#8746; Z 1 &#8746; Z 2 and the fact that Z is a local maximizer, implies that Z &#8712; Z 1 . Using this, (20), and Proposition 4.2 yields that Z satisfying ( <ref type="formula">7</ref>), (a), (b), and (c).</p><p>(ii) According to (i) in Theorem 3.1, suppose that the k-th block of a local maximizer Z admits the decomposition in ( <ref type="formula">7</ref>) satisfying (a), (b), and (c) for all k &#8712; [K]. This, together with (42) in the proof of Proposition 4.3, yields that</p><p>where &#963; k is defined in ( <ref type="formula">16</ref>) for each k &#8712; [K]. Then, we define a function g :</p><p>One can verify that for all n 1 &#8805; n 2 , we have g(n 1 , x) &#8804; g(n 2 , x) for each x. Therefore, we have for all m k &#8804; m l ,</p><p>where the second inequality follows from &#963; 2 k is the maximizer of the function g(m k , x) = h k (x) according to (43). This, together with (61), yields that Z is a global maximizer if and only if K k=1 r k = min{m, d} and for all k &#824; = l satisfying m k &lt; m l and r l &gt; 0, we have r k = min{m k , d}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2. Proof of Proposition 3.2</head><p>To prove Proposition 3.2, we first need to characterize the global optimal solution set of Problem (4).</p><p>where</p><p>Proof. According to Lemma A.5, we have</p><p>where the inequality becomes equality if and only if Z T k Z l = 0 for all k &#824; = l. To simplify our development, let r k := rank(Z k ) denote the rank of Z k &#8712; R d&#215;m k , where r k &#8804; min{d, m k } for each k &#8712; [K] and K k=1 r k &#8804; min{d, m}, and</p><p>Moreover, let</p><p>&#931;k 0 0 0</p><p>be a singular value decomposition (SVD) of Z k , where</p><p>. Substituting this SVD into (64), together with &#8741;Z k &#8741; 2 F = m k , yields that to maximize H(Z), it suffices to study for each k &#8712;</p><p>Lemma D.2. Suppose that m, K are integers such that m/K is a positive integer, r &#8804; m/K is an integer, and &#945; &gt; 0 is a constant. Consider the following optimization problem</p><p>the optimal solution is</p><p>Proof. If r = 1, it is trivial to see that ( <ref type="formula">69</ref>) is the optimal solution. Therefore, it suffices to study r &#8805; 2. To simplify our development, let f (x) := -log(1 + &#945;x) + log(1 + K&#945;x)/K and F (x) := r i=1 f (x i ). Then, one can verify that for all x &#8805; 0,</p><p>Introducing dual variables &#955; associated with the constraint r i=1 x i = m/K and &#181; i associated with the constraint x i &#8805; 0 for each i &#8712; [r], we write the Lagrangian as follows</p><p>Then, we write the KKT system as follows:</p><p>Now, let S := {i &#8712; [r] : x i &gt; 0} denote the support set of a KKT point x &#8712; R r and s := |S| denote the cardinality of the support set, where 1 &#8804; s &#8804; r. This, together with (72), implies that for each i &#8712; S,</p><p>This is equivalent to the following quadratic equation:</p><p>We compute</p><p>Note that for all i &#8712; S, we have x i &gt; 0, and thus &#181; i = 0. This, together with K &#8805; 2 and the first equation in (73), implies &#955; &gt; 0. Consequently, the quadratic equation ( <ref type="formula">74</ref>) has a positive root if and only if &#951; &#8805; 0 and &#8710; &#8805; 0. This implies</p><p>Then, the solution of Problem ( <ref type="formula">74</ref>) is x i &#8712; {x, x} for each i &#8712; S, where</p><p>Now, we discuss the KKT points that could potentially be optimal solutions. Let x &#8712; R r be a KKT point satisfying x i &#8712; {x, x} for each i &#8712; S, where s &#8712; {1, 2, . . . , r}. In particular, when s = 1, we have x i = m/K for all i &#8712; S. In the following, we consider s &#8712; {2, . . . , r}.</p><p>Case 1. Suppose that x i = x j for all i, j &#8712; S. This, together with i&#8712;S x i = m/K and x i &#8712; {x, x} for each i &#8712; S, yields</p><p>Case 2. Suppose that there exists i &#824; = j &#8712; S such that x i &#824; = x j . This, together with x i &#8712; {x, x}, implies x &gt; x. According to (70), we have f &#8242;&#8242; (x) = 0 at x = 1/(&#945; &#8730; K). Then, we obtain that f &#8242; (x) is strictly decreasing in [0, x] and strictly increasing in [x, &#8734;]. Then, one can further verify that x &lt; x &lt; x. This, together with (70), implies</p><p>For ease of exposition, let l(x) = |{i &#8712; S : x i = x}| be the number of entries of x that equal to x. Then, we claim that any optimal solution x &#8902; satisfies l(x &#8902; ) &#8804; 1. Now, we prove this claim by contradiction. Without loss of generality, we assume that x &#8902; i = x for all i = 1, . . . , r -l and x &#8902; i = x for all i = r -l + 1, . . . , r with l &#8805; 2. This, together with (79) and l &#8805; 2, yields</p><p>Using the second-order necessary condition for constraint optimization problems (see, e.g., <ref type="bibr">(Nocedal &amp; Wright, 1999, Theorem 12</ref>.5)) and</p><p>Then, we take</p><p>which contradicts (80). Therefore, x &#8902; cannot be an optimal solution. Then, we prove the claim. In this case, we can write the KKT point that could be an optimal solution as follows: There exists i &#8712; S such that</p><p>Now, we compute the function values for the above two cases, i.e., ( <ref type="formula">78</ref>) and ( <ref type="formula">82</ref>), and compare them to determine which one is the optimal solution. To simplify our analysis, let h(x) := f (x)/x. For (78) in Case 1, we compute</p><p>For (82) in Case 2, we compute</p><p>where the inequality is because x &#8804; m/ ((s -1)K) and f (x) is strictly dereasing in [0, &#8734;). For all x &#8712; [m/(sK), m/ ((s -1)K)], we compute</p><p>where the first inequality follows from log(1 + &#945;x) -log(1 + K&#945;x)/K &#8805; log(1 + &#945;) -log(1 + &#945;K)/K due to x &#8805; m/(sK) &#8805; 1, the second inequality uses log(1 + &#945;) -log(1 + &#945;K)/K = (K -1) log(1 + &#945;)/K + log ((1 + &#945;)/(1 + &#945;K)) /K &#8805; ((K -1) log(1 + &#945;) -log K) /K, and the last inequality is because of x &#8805; m/(sK).</p><p>According to (68), we have</p><p>Using the mean-value theorem, there exists x &#8712; (m/(sK), m/ ((s -1)K)) such that</p><p>where the inequality follows from (85). Now, we are devoted to bounding f (x). According to ( <ref type="formula">75</ref>) and ( <ref type="formula">76</ref>), we have</p><p>This, together with the fact that f (x) is decreasing in (0, &#8734;), yields</p><p>where the last inequality uses &#951; <ref type="formula">76</ref>). This, together with ( <ref type="formula">83</ref>), (84), and (87), yields</p><p>where the last inequality follows from (68). This implies that the optimal solution takes the form of (78) for some s &#8712; [r].</p><p>Consequently, the function value of (78) for each s &#8712;</p><p>This, together with Lemma D.3, implies that when the optimal solution takes the form of ( <ref type="formula">78</ref>) with s = r, Problem (67) achieves its global minimum. Then, we complete the proof.</p><p>Lemma D.3. Consider the setting in Lemma D.2 and the following function</p><p>where s &#8712; [1, r] and &#945; satisfies (68).</p><p>Proof. For ease of exposition, let &#946; := m&#945; and x := 1/s &#8712; [1/r, 1]. According to (68), we have</p><p>This implies &#946; &#8805; r &#8730; K. Then, we study According to the results in Figures 5(a), 5(b), 6, and 7 of Experiments 1 and 2, we observe that when the number of samples is larger than its dimension, i.e., m &#8805; d, the learned features via the MCR 2 principle are within-class compressible and between-class discriminative in both balanced and unbalanced data sets. Moreover, the dimension of the space spanned by these features is maximized such that K k=1 r k = min{m, d}. This directly supports Theorem 3.1. By comparing the function value returned by GD and that computed by the closed-form in Theorem 3.1, we found that GD with random initialization will always converge to a global maximizer of Problem (5) when the data is balanced, while it will always converge to a local maximizer of Problem (5) when data is unbalanced. This directly supports Theorem 3.3.</p><p>According to the results in Figures <ref type="figure">5(c</ref>) and 8 of Experiment 3, we observe that when the number of samples is smaller than its dimension, i.e., m &#8804; d, the learned features via the MCR 2 principle are orthogonal to each other and the dimension of each subspace is equal to the number of samples, i. In the experiments on MNIST, we employ a 4-layer multilayer perception (MLP) network with ReLU activation as the feature mapping with the intermediate dimension 2048 and output dimension 32. In particular, each layer of MLP networks consists of a linear layer and layer norm layer followed by ReLU activation in the implementation. We train the network parameters via Adam by optimizing the MCR 2 function. For the Adam settings, we use a momentum of 0.9, a full-batch size, and a dynamically adaptive learning rate initialized with 5 &#215; 10 -3 , modulated by a CosineAnnealing learning rate scheduler <ref type="bibr">(Loshchilov &amp; Hutter, 2016)</ref>. We terminate the algorithm when it reaches 3000 epochs.   More experimental results on MNIST. Besides the numerical results in Table <ref type="table">1</ref>, we also plot the heatmap of the cosine similarity between pairwise columns of the features in Z obtained by training deep networks in Figure <ref type="figure">9</ref>. We observe that the features from different classes are nearly orthogonal to each other, while those from the same classes are highly correlated. This supports our results in Theorem 3.1.</p><p>Network architecture and training setups for CIFAR-10. In the experiments on CIFAR10, we employ a 2-layer multilayer perceptron (MLP) network with ReLU activation as the feature mapping with the intermediate dimension 4096 and output dimension 128. In particular, each layer of MLP networks consists of a linear layer and layer norm layer followed by ReLU activation in the implementation. We train the network parameters via Adam by optimizing the MCR 2 function. For the Adam settings, we use a momentum of 0.9, a full-batch size, and a dynamically adaptive learning rate initialized with 5 &#215; 10 -3 , modulated by a CosineAnnealing learning rate scheduler <ref type="bibr">(Loshchilov &amp; Hutter, 2016)</ref>. We terminate the algorithm when it reaches 4000 epochs.</p><p>More experimental results on CIFAR-10. Besides the numerical results in Table <ref type="table">1</ref>, we also plot the heatmap of the cosine similarity between pairwise columns of the features in Z obtained by training deep networks in Figure <ref type="figure">9</ref>. We observe that the features from different classes are nearly orthogonal to each other, while those from the same classes are highly correlated. This supports our results in Theorem 3.1.  <ref type="formula">5</ref>) on m samples split equally among K classes of MNIST and CIFAR-10. In the figure, the darker pixels represent higher cosine similarity between features. In particular, when the (i, j)-th pixel is close to 0 (very light blue), the features i and j are approximately orthogonal.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Department of Electrical Engineering and Computer Science, University of California, Berkeley</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>Department of Computer Science and Engineering, The Ohio State University, Columbus</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>Institute of Data Science, University of Hong Kong. Correspondence to: Peng Wang &lt;peng8wang@gmail.com&gt;, Yi Ma &lt;mayi@hku.hk&gt;. Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_5"><p>Please refer toChan et al. (2022, Section  </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_6"><p>2.1) for more details on measuring compactness of feature sets via coding rates.2 When performing maximization, we actually mean that we use gradient ascent. However, we write gradient descent to maintain consistency with existing optimization literature.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_7"><p>We are maximizing the MCR 2 objective. Maximizing a concave function is equivalent to minimizing a convex function.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_8"><p>Here, R(Z) is also known as the rate-distortion function in information theory<ref type="bibr">(Cover, 1999)</ref>, which represents the average number of binary bits needed to encode the data Z.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_9"><p>We say that a critical point is a strict saddle point of Problem (5) if it has a direction with strictly positive curvature; see Definition A.2. This includes classical saddle points with strictly positive curvature as well as local minimizers.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_10"><p>We estimate the rank of a matrix by rounding its "stable rank"<ref type="bibr">(Horn &amp; Johnson, 2012)</ref>:r k = round(&#8741;Z k &#8741; 2 F /&#8741;Z k &#8741; 2 ).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_11"><p>Note that Problem (5) is not a minimization problem but a maximization problem.</p></note>
		</body>
		</text>
</TEI>
