<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Case Study of Low Ranked Self-Expressive Structures in Neural Network Representations</title></titleStmt>
			<publicationStmt>
				<publisher>The Second Conference on Parsimony and Learning (Proceedings Track)</publisher>
				<date>03/24/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10588419</idno>
					<idno type="doi"></idno>
					
					<author>Uday Singh Saini</author><author>William Shiao</author><author>Yahya Sattar</author><author>Yogesh Dahiya</author><author>Samet Oymak</author><author>Evangelos E Papalexakis</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Understanding neural networks by studying their underlying geometry can help us understand their embedded inductive priors and representation capacity. Prior representation analysis tools like (Linear) Centered Kernel Alignment (CKA) offer a lens to probe those structures via a kernel similarity framework. In this work we approach the problem of understanding the underlying geometry via the lens of subspace clustering, where each input is represented as a linear combination of other inputs. Such structures are called self-expressive structures. In this work we analyze their evolution and gauge their usefulness with the help of linear probes. We also demonstrate a close relationship between subspace clustering and linear CKA and demonstrate its utility to act as a more sensitive similarity measure of representations when compared with linear CKA. We do so by comparing the sensitivities of both measures to changes in representation across their singular value spectrum, by analyzing the evolution of self-expressive structures in networks trained to generalize and memorize and via a comparison of networks trained with different optimization objectives. This analysis helps us ground the utility of subspace clustering based approaches to analyze neural representations and motivate future work on exploring the utility of enforcing similarity between self-expressive structures as a means of training neural networks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Analysing structures in representations of trained Neural Networks has been the subject of interest for many post-hoc interpretability methods <ref type="bibr">[1]</ref>. <ref type="bibr">[2]</ref> propose a Centered Kernel Alignment (CKA) <ref type="bibr">[3]</ref> based similarity measure between linear kernels of network activations (Linear-CKA) that been used to compare deep and wide neural networks in <ref type="bibr">[4]</ref>, analysing Vision Transformers <ref type="bibr">[5]</ref> vs ResNets <ref type="bibr">[6]</ref> in <ref type="bibr">[7]</ref>, comparing effects of loss functions <ref type="bibr">[8]</ref>, differences between self-supervised and supervised methods <ref type="bibr">[9]</ref> and differences between self-supervised objectives for Vision Transformer representations <ref type="bibr">[10]</ref>. Recently, works like <ref type="bibr">[11]</ref> and <ref type="bibr">[12]</ref> have demonstrated that Linear-CKA [2] similarity is usually dominated by similarity between singular vectors of neural activations possessing the largest singular values, thereby rendering it insensitive to differences in singular vectors with smaller singular values. <ref type="bibr">[11]</ref> propose a sensitivity test to rigorously evaluate similarity measures by observing the effects of changes in internal representations of a network on a linear classifier's performance on those representations. Taking into account the observations made in <ref type="bibr">[11]</ref> about the spectral behaviour of Linear-CKA we motivate a Low Ranked Subspace Clustering (LRSC) <ref type="bibr">[13]</ref> based pairwise affinity measure in conjunction with CKA and show its relationship to Linear-CKA <ref type="bibr">[2]</ref>. We demonstrate how this choice ameliorates some issues raised by <ref type="bibr">[11]</ref> regarding Linear-CKA while also offering a more extensive comparison between the two in Section 5. Since an LRSC Kernel over neural activations highlights self-expressive structures <ref type="bibr">[14]</ref> in neural rep-resentations, the combination of LRSC with CKA compares the similarity between self-expressive structures of 2 neural representations. In Section 6 we demonstrate that self-expressive structures become more class-concentrated as measured by its subspace representation reconstruction (sub. recon.) <ref type="bibr">[15,</ref><ref type="bibr">16]</ref> as we go deeper in the network's layers. Furthermore, this reconstruction based accuracy strongly correlates with a linear probe's <ref type="bibr">[17]</ref> performance on the same internal representations, thereby serving as a tool to understand intermediate representations of neural networks via computing just its singular vectors. Additionally in Section 6.2 we analyse networks which generalise well and compare them to networks which memorise parts of their training set and observe that for most of the layers of these 2 networks the learnt representations are similar and the dissimilarities between them only appear in the last few layers where each network learns markedly different representations. These observations are in alignment with results from <ref type="bibr">[18]</ref><ref type="bibr">[19]</ref><ref type="bibr">[20]</ref>.</p><p>In Appendix D we explore the limits of representations analysis using tools that approximate linear subspaces. In this setup we use rational activations <ref type="bibr">[21]</ref> based ResNets and compare them with ReLU based ResNets under settings of generalisation and memorisation. We test the efficacy of LRSC-CKA and Linear-CKA to discern differences between rational networks with varying generalisation performance and demonstrate deficiencies in their ability to discover meaningful differences between networks trained in different regimes. We then take another prominent approach for representation analysis called Mean-Field Theoretic Manifold Analysis (MFTMA) <ref type="bibr">[19]</ref> and demonstrate similar deficiencies its ability for the same task. Finally in Appendix E to understand the emergence of self-expressive structures in networks trained on Cross-Entropy loss we compare these networks with networks trained on Maximum Coding Rate Reduction(MCRR) <ref type="bibr">[22]</ref> Loss. MCRR Loss encourages the model to separate out data points from different classes into different subspaces, thereby encouraging the development of self-expressive structures. In doing such a comparison we find that final layers of cross entropy trained networks indeed share similarity with networks trained on MCRR loss, thereby indicating formation of selfexpressive structures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Understanding neural networks via comparing the similarity of their internal representations has been the subject of various lines of research with many different similarity measures. To begin with, <ref type="bibr">[23]</ref> propose a Canonical Correlation Analysis (CCA) <ref type="bibr">[24]</ref> based tool called SVCCA, which uses an SVD over the representations of the network to remove noise before proceeding to compare them using Canonical Correlational Analysis. Building upon SVCCA, <ref type="bibr">[18]</ref> propose a different weighting of canonical correlations, thereby calling their methodology Projection Weighted Canonical Correlation Analysis (PWCCA). Subsequently [2] which utilises Centered Kernel Alignment (CKA) <ref type="bibr">[3]</ref> to measure similarities between kernels derived out of layerwise activations demonstrates some limitations of CCA in terms of its inability to discover architecturally identical layers in networks trained with different initialisations. Similar limitations of CCA based methods are also demonstrated in <ref type="bibr">[11]</ref>. [2] predominantly utilises linear kernels for measuring similarity between networks and therefore we shall refer it to as Linear-CKA. Other representation similarity based approaches like <ref type="bibr">[25]</ref> perform representation similarity analysis by computing correlations between representation similarity matrices based on various distance measures. <ref type="bibr">[26]</ref> compares the similarity of representations by considering the distances of positive semi-definite kernels on the Riemannian manifold. AGTIC <ref type="bibr">[27]</ref> proposes an adaptive similarity criteria that ignores extreme values of similarity in the representations. <ref type="bibr">[28]</ref> utilises Normalised Bures Similarity <ref type="bibr">[29]</ref> to study similarity of neural networks with respect to layerwise gradients. Beyond just utilising representations, works like Representation Topology Divergence <ref type="bibr">[30]</ref> learn a graph based on embeddings and then computing similarity based on various connected components in the graph. Works like <ref type="bibr">[31]</ref> try to use cosine information to compute an adjacency matrix and study the modularity <ref type="bibr">[32]</ref> of the resulting graph. Similar approach was also taken in <ref type="bibr">[33]</ref> which computes a graph based on sparse subspace representation <ref type="bibr">[14]</ref> and analyses the modularity of such graphs, along with using CKA <ref type="bibr">[2]</ref> to compute the similarity between graphs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Background</head><p>In this section we establish the foundation for the tools used and procedures adopted in our Subspace based analysis of Neural Network Representations. For a reading of related work, please refer to Section 2. We begin by laying the background on Low Ranked Subspace Clustering (LRSC) <ref type="bibr">[13]</ref> and provide justification for its use in Section 3.1. Then in Section 3.2 we describe Centered Kernel Alignment or CKA <ref type="bibr">[3,</ref><ref type="bibr">34]</ref>, a well known technique for representation similarity comparison. We then combine LRSC with CKA and the resultant approach is described in Section 4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Low Rank Subspace Clustering</head><p>Given a Matrix X &#8712; R d&#215;N of N data points which in the context of this study will be activations of hidden layers of a Neural Network. Low Rank Subspace Clustering or LRSC <ref type="bibr">[13]</ref> tries to uncover the underlying structure of the data in this union of subspaces. LRSC accomplishes this by trying to find a low rank representation of each point subject to Self-Expressiveness <ref type="bibr">[35]</ref> constraint, where each point is expressed as a linear combination of other points in the subspace. More concretely, given a low rank matrix X = [x 1 , . . . , x N ] where x i &#8712; R d &#8704; i. The goal of LRSC is to learn an affinity matrix C = [c 1 , . . . , c N ] &#8712; R N &#215;N where each column c i &#8712; R N is the representation of x i as a linear combination of other data points x j 's &#8704; j. More specifically each entry C ij in the matrix C denotes the weight of x j in the self-expressive reconstruction of x i . A noiseless version of LRSC <ref type="bibr">[13]</ref>, henceforth called LRSC-Noiseless, aims to solve the objective in Equation <ref type="formula">1</ref>.</p><p>Our goal in utilising LRSC is to analyse and compare internal activations of neural networks over a set of N data points in an architecture agnostic manner. Therefore, we utilise the noise-robust version of LRSC, as proposed in <ref type="bibr">[13]</ref> and also shown in Equation 2. Utilising subspace clustering helps us learn a pairwise affinity kernel or a graph between N data points. Doing so helps us represent every layer of a neural network as an R N &#215;N matrix, which is architecture agnostic, thereby facilitating analysis and comparisons of different layers of same and different networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Centered Kernel Alignment</head><p>Starting with [2], Centered Kernel Alignment or CKA ([3], <ref type="bibr">[34]</ref>) has emerged has a key tool to analyse representations of Neural Networks ( <ref type="bibr">[7]</ref>, <ref type="bibr">[36]</ref>, <ref type="bibr">[37]</ref>, <ref type="bibr">[38]</ref>). Given 2 neural activation matrices of layer i and j, namely X &#8712; R di&#215;N and Y &#8712; R dj &#215;N , Linear-CKA [2] computes their respective R N &#215;N inner product kernels K = X T X and L = Y T Y. It then utilises CKA to compute a similarity between two general kernels as shown in the Equation 3, where the equality on the left computes the CKA similarity between any pairwise similarity matrices K and L. Similarly, the equality on the right, also called CKA Lin , is a derived form of CKA for linear kernels X T X and Y T Y where &#955; i X , &#955; j Y are the i th and j th squared singular vectors and v i X , v j Y are the i th and j th right singular vectors of activation matrices X and Y respectively. Note that, HSIC or Hilbert Space Independence Criterion <ref type="bibr">[39]</ref> used in Equation 3 is a way to compute the similarity between two R N &#215;N kernel matrices and serves as the backbone of CKA.</p><p>where HSIC(K, L) = tr(HKHHLH) (N -1) 2 and H = I -1 N 11 T</p><p>(3) While for the purposes of this work we do refer to [2] as Linear-CKA, the authors of [2] also experiment with other Kernels like the Radial Basis Functions and demonstrate their effectiveness. <ref type="bibr">[38]</ref> is another study in the line of analysing neural representations that studies the application of general non linear kernels to analyse neural representations with CKA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Method</head><p>We now describe the methodologies used in this study to analyse neural networks. We begin by describing how we use LRSC based affinity matrices to compute LRSC-CKA and a Subspace Representation based classifier in Section 4.1 and Section 4.2, respectively. Then as a counter-part to Section 4.2 and analogous to the methodology adopted in <ref type="bibr">[33]</ref> we define a Linear-CKA based classifier scheme in Section A.1. Lastly, in Section A.2 we describe the configurations and protocols for training followed in subsequent sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">LRSC-CKA</head><p>Algorithm 1: All pairs CKA Data: Activation Matrices: [X 1 , . . . , X l ] Result: Affinity Matrices:</p><p>Use Equation 3 for Linear-CKA.</p><p>Based on discussions in Section 3 we frame LRSC-CKA as a spectral variant of Linear-CKA, an experimental analysis for establishing that is conducted in Section 5.1. Please note for all the results used throughout the paper we use Equation 2 to compute the LRSC Affinity Matrices, but for simplicity, let's consider a noiseless version of the problem described in Equation 1. Given neural activation matrices for layer i and j as X &#8712; R di&#215;N and Y &#8712; R dj &#215;N , we first compute their respective LRSC Affinity matrices denoted C X and C Y based on Equation 1. Based on the formula for Linear-CKA utilising the Singular Value Decomposition of activation matrices X and Y as shown in Equation 3, we write an analogous formula for LRSC-CKA in Equation <ref type="formula">4</ref>for low rank approximations of X with rank &#964; 1 and Y with rank &#964; 2 . Unless otherwise stated, for all LRSC-CKA computations in this study we select the low rank &#964; as the number of components which explain 80% of the variance in the matrix.</p><p>Using the noiseless variant of LRSC from Equation 1 allows us to more easily demonstrate that LRSC-CKA is a uniformly weighted sum of pairwise cosine similarities of top &#964; right singular vectors of X and Y. In contrast to Linear-CKA from Equation 3 this uniformity over a set of &#964; singular vectors ensures that LRSC-CKA is sensitive to changes beyond the dominant singular vectors, an issue that plagues Linear-CKA <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref> and algorithm 1 describes the process for computing LRSC-CKA for a given neural network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Subspace representation based classification</head><p>Next, we describe subspace representation reconstruction (sub. recon.) based classification from <ref type="bibr">[15]</ref>, <ref type="bibr">[16]</ref>. Given a point x i &#8712; R d and its self expressive encoding c i &#8712; R N learned via LRSC a perclass, reconstruction residual as defined in Equation <ref type="formula">5</ref>. Once r (k) i for all classes have been computed, x i is then assigned to the class, c, with the smallest residual norm &#8741;r (c) i &#8741; 2 . A higher value in this metric indicates a higher degree of co-planarity of a data point with respect to other points of the same class among different classes. Since LRSC encodes the degree of co-planarity between data points, layerwise LRSC-CKA is essentially a metric of similarity based upon co-planarity of data point x i 's across various layers of a network. Computation of a subspace reconstruction based class label only requires an SVD of activations X l of a set of inputs for a given layer, which is obtained as a consequence of computing LRSC-CKA between any 2 layers. It doesn't require any additional training of linear classifiers for that layer's activations, thus making it a viable probe to evaluate linear structures in the activation space of a network.</p><p>The computation of subspace reconstruction based classification for every layer of the network is performed as follows -(1) Using algorithm 1 for LRSC computation we obtain the set of layerwise LRSC matrices {C l }. (2) Given each C l &#8712; R N &#215;N encodes the subspace representations for at network layer l for inputs x 1 , . . . , x N . For each input x i we compute the class-wise subspace residual r i &#8741; and do so for all inputs i over all layers l.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Comparing Low Rank Subspace Clustering based CKA and Linear-CKA</head><p>Our goal is to analyse the role played by the singular value spectrum of activations of a given neural network and how different functions over the spectrum yield different interpretations. More specifically, as shown in Section 4.1 LRSC imposes a shrinkage operator like step-function over the singular values. Singular values below a certain rank are 0 and the rest are given an equal weight. Whereas by contrast as shown in Section 3.2, Linear-CKA squares the singular values of the representation matrices, which causes it to be more sensitive to singular vectors with high singular values, as shown in Section 5.1, <ref type="bibr">[11]</ref> and <ref type="bibr">[12]</ref>. A more analytical analysis of this fact is performed in Appendix F.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Spectral analysis of LRSC-CKA and Linear-CKA</head><p>When computing the similarity between neural activation matrices X and Y , Linear-CKA computes a weighted average over the cosine similarities of left singular vectors of X and Y as shown in Equation 3 and (Noiseless) LRSC-CKA computes a uniformly weighted average of those components up to a certain rank. Recent works ( <ref type="bibr">[11]</ref>, <ref type="bibr">[38]</ref>, <ref type="bibr">[12]</ref>) have shown that Linear-CKA is mostly sensitive to changes in directions of topmost principal components and not sensitive to lower principal component deletion. We demonstrate that by the virtue of uniformly weighting cosine similarity's of principal components (PC), LRSC-CKA is sensitive to changes with greater uniformity. Similar to the protocol followed in <ref type="bibr">[11]</ref> we describe the principal component (PC) sensitivity tests and present the results in Table <ref type="table">1</ref>.</p><p>Given the original neural activation matrix X for a given layer and a set of its low rank representations S, we perform a spectral sensitivity analysis comparing LRSC-CKA and Linear-CKA along the lines of <ref type="bibr">[11]</ref>. For the Top PC Addition Test in Table <ref type="table">1</ref> the set S consists of low rank representations starting with the first PC and going up to a representation that contains the top 50% PCs. The bottom PC Deletion Test starts with top 80% Principal Components and removes them down to top 30% PCs, the lowest 20% PCs are not used to maintain a parity for comparison. For the purpose of experimental validation we perform this analysis on the last 5 layers, just like in <ref type="bibr">[11]</ref> and report the average for each network. Given Low Rank Representations S = {X &#964; } &#964;2 &#964;1 , where &#964; 1 and &#964; 2 denote the start and end for number of principal components in the low rank representation. The Principal Component Sensitivity Test for a given layer is performed as follows -1. Given the layer's neural activation matrix X, compute the linear probe accuracy, denoted f (X), LRSC affinity matrix based on Equation 2 denoted by C X and Linear Kernel K X = X T X. 2. For each low rank representation X &#964; &#8712; S:</p><p>&#8226; Compute f (X &#964; ), C X&#964; and K X&#964; -The linear probe accuracy, LRSC Affinity and Linear Kernel of the said low rank representation. &#8226; Compute |f (X) -f (X &#964; )|, the difference in linear probe accuracies between the original representation and the low rank representation. &#8226; Compute CKA(C X , C X&#964; ) and CKA(K X , K X&#964; ), the LRSC-CKA and the Linear-CKA between the original and low rank representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Compute the Pearson's Correlation</head><p>to compute the sensitivity for LRSC-CKA and Linear-CKA respectively. Please note that as 2 representations become similar, their CKA score will increase and the linear probe's accuracy difference between them will decrease, therefore we expect &#961; to be more negative in case of higher sensitivity.</p><p>We present the results of this procedure over 5 different random seeds of ResNet20 on CIFAR10 and CIFAR100 in Table <ref type="table">1</ref>. For each network we perform the Principal Component Sensitivity Test on the last 5 layers and compute Pearson's Correlation Coefficient for LRSC-CKA and Linear-CKA for each layer and show the mean and standard deviation. We observe that for Top PC Addition Test both LRSC-CKA and Linear-CKA are sensitive to changes in the Top most Principal Components. But for changes in lower principal components as demonstrated by the Bottom PC Deletion Test we observe that LRSC-CKA is much more sensitive than Linear-CKA. Therefore, LRSC-CKA has a higher sensitivity to change throughout the spectrum of an activation matrix as opposed to Linear-CKA, which is sensitive only to changes in the topmpost PCs <ref type="bibr">[11]</ref>, <ref type="bibr">[38]</ref>, <ref type="bibr">[12]</ref>. A theoretical analysis of this phenomena is further presented in Appendix F.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Top Principal Component Addition Sensitivity Test</head><p>&#961; -LRSC &#181; -0.88 -0.9 -0.88 -0.</p><p>9 -0.89 -0.98 -0.98 -0.99 -0.98 -0.98 &#963; 0.07 0.05 0.06 0.05 0.07 0.009 0.004 0.005 0.007 0.005 &#961; -Linear &#181; -0.96 -0.96 -0.97 -0.97 -0.95 -0.85 -0.85 -0.85 -0.84 -0.85 &#963; 0.04 0.03 0.02 0.04 0.05 0.14 0.13 0.15 0.15 0.14 Bottom Principal Component Deletion Sensitivity Test Setup &#8594; CIFAR10 Network R20 CIFAR100 Network R20 CKA &#8595; V1 V2 V3 V4 V5 V1 V2 V3 V4 V5 &#961; -LRSC &#181; -0.93 -0.95 -0.93 -0.94 -0.93 -0.94 -0.95 -0.96 -0.95 -0.96 &#963; 0.02 0.01 0.01 0.02 0.02 0.03 0.02 0.01 0.02 0.009 &#961; -Linear &#181; -0.51 -0.53 -0.62 -0.44 -0.45 -0.53 -0.55 -0.55 -0.57 -0.56 &#963; 0.74 0.63 0.45 0.75 0.68 0.79 0.8 0.8 0.8 0.79 Table 1: Avg. Pearson Correlation Coefficients &#961; for Principal Components Additional and Deletion Tests for Linear-CKA and LRSC-CKA. 5 Networks with different initialisations used for each dataset, denoted V1-V5. This shall be the norm for using this notation for subsequent experiments unless otherwise stated.</p><p>As show above, the main advantage an LRSC framing of CKA instead of a Linear Kernel framing is that it helps us unlock more sensitivity to help detect changes in representations across a wider spectrum of their singular values, thus highlighting the main advantage of LRSC-CKA over Linear-CKA.</p><p>substantially in their later layers as shown in Figure <ref type="figure">2c</ref>, while <ref type="bibr">[43]</ref> demonstrate that memorisation is confined to a set of neurons rather than layers, observations similar to ours were also made in <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, <ref type="bibr">[20]</ref>. This phenomena is also highlighted by a decrease in the class-label homogeneity of the self-expressive structures, as shown in Figure <ref type="figure">2c</ref>, of the 2 networks as they offer similar reconstruction based accuracy performance for all except the last few layers. Figure <ref type="figure">2d</ref> -Figure <ref type="figure">2f</ref> show a similar set of conclusions between networks which generalise and memorise on the CIFAR100 dataset. Figure <ref type="figure">18</ref> shows a similar analysis using Linear-CKA which also demonstrates that the major changes between the two types of networks appear towards the end of the networks, but a Linear-Kernel coefficient based classification methodology as described in Section A.1 isn't a reliable indicator of performance shift. A more comprehensive set of results demonstrating the differences between networks offering strong generalisation and memorisation performance while establishing their independence from network depth along with experimental setup details are described in Appendix H. Next, along the lines of Section 6.1, we establish the robustness of subspace reconstruction based classification as defined in Section 4.2 by correlating its performance with that of a linear classifier trained on intermediate layers of over fitted neural networks. We train different ResNets on CIFAR10 and CIFAR100 with 50% of the data randomly labelled, see Figure <ref type="figure">3</ref> and Figure <ref type="figure">19</ref>, and measure the correlation of our metric with the accuracy of a linear classifier and present the results in Table <ref type="table">3</ref>.</p><p>The goal in doing so is to establish that the layer-wise correlations observed earlier in Section 6.1 are not dependent on an inherently well performing model. As shown in Table <ref type="table">3</ref> the subspace reconstruction based label assignment, denoted by LRSC recon. acc., performs better than Linear-CKA coefficient based label assignment, indicating that the class cohesiveness of the self-expressive structures offers more insights into the generalisation performance than dot-products of activations from the same class. This establishes the subspace reconstruction approach as a valuable alternative to learning a linear classifier which first requires a computational overhead of training a classifier for all layers of the network as the subspace reconstruction based accuracy can be readily computed for any set of input activations. Additional results are presented in Appendix C, Appendix I, Appendix J, Appendix K and Appendix L .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion and Discussion</head><p>In this work we demonstrate that the use of self-expressive structures to understand the underlying geometry in representations of hidden layers of a neural network and its relation to previously well established methods. In doing so we use Low Rank Subspace Clustering (LRSC) on the activations of hidden layers of neural networks to encode each layer as a self-expressive affinity matrix which is architecture agnostic. We then use Centered Kernel Alignment(CKA) to compare affinity matrices of various layers of a network and across networks, and in doing so demonstrate that :</p><p>&#8226; We demonstrate that the combination of LRSC with CKA is an alternate spectral formulation of Linear-CKA which makes the similarity measure more sensitive to changes over a broader spectrum of principal components of the representations. Such a connection was lacking in prior working utilising subspace clustering <ref type="bibr">[33,</ref><ref type="bibr">44]</ref> to analyse representations .</p><p>I.5 . Correlation of layer wise LRSC coefficient based accuracy with Linear Probe accuracy for Noisy CIFAR100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I.6 . Correlation of layer wise Linear CKA coefficient based accuracy with Linear Probe accuracy for Noisy CIFAR100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J . Additional results for analysis conducted on Mini ImageNet 100 J.1 . Linear-CKA Analysis of the ResNet 56 on Mini ImageNet 100 . . . . . . . . . . . . . . J.2 . Aggregated network pairwise and one to one layer comparison of clean and noisy ResNet 56s on Mini Image Net 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J.3 . Additional LRSC-CKA Analysis of the ResNet 56s on Mini ImageNet 100 . . . . . . . J.4 . Additional Linear-CKA Analysis of the ResNet 56s on Mini ImageNet 100 . . . . . . J.5 . Additional LRSC-CKA Analysis of the previous networks on Mini ImageNet 100 . . J.6 . Corresponding Linear-CKA Analysis of the previous networks on Mini ImageNet 100 K. Additional Results comparing the effects of memorisation and generalization on Bottleneck Multi Layer Perceptrons (B-MLP) K.1. Aggregated analysis -Bottleneck Multi Layer Perceptrons (B-MLP) on CIFAR10 . . . K.2. Correlation LRSC analysis -Bottleneck Multi Layer Perceptrons (B-MLP) on CIFAR10 K.3. Aggregated analysis -Bottleneck Multi Layer Perceptrons (B-MLP) on CIFAR100 . . K.4. Correlation analysis -Bottleneck Multi Layer Perceptrons (B-MLP) on CIFAR100 . . K.5. Aggregated analysis -Bottleneck Multi Layer Perceptrons (B-MLP) on Mini Image Net 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . K.6. Correlation analysis -Bottleneck Multi Layer Perceptrons (B-MLP) on MIN100 . . . </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>L. Additional Results comparing the effects of memorisation and generalization on</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>O . Additional Results comparing the effects of training neural networks on cross entropy vs maximum coding rate reduction loss</head><p>O.1. Additional LRSC-CKA results analyzing the effects of training ResNets with Maximal Coding Rate Reduction and Cross Entropy losses on CIFAR10 . . . . . . . . . . . . . O.2. Additional LRSC-CKA results analyzing the effects of training ResNets with Maximal Coding Rate Reduction and Cross Entropy losses on CIFAR100 . . . . . . . . . . . . O.3. Observing and analyzing the effects of training ResNets with Maximal Coding Rate Reduction and Cross Entropy losses on CIFAR10 with Linear-CKA . . . . . . . . . . O.4. Observing and analyzing the effects of training ResNets with Maximal Coding Rate Reduction and Cross Entropy losses on CIFAR100 with Linear-CKA . . . . . . . . . the rational network, therefore providing supplementary evidence for increased linear separation in case of ReLU networks. Next, comparing the same ReLU resnet (R20) with a noisily trained rational resnet (R20rn) in Figure <ref type="figure">12b</ref> we observe that last layers of both ResNets aren't similar to the same degree as was the case earlier in Figure <ref type="figure">12a</ref> when both networks were normally trained. This also indicated by diverging subspace reconstruction and linear probe accuracies and much lower &#945; M for noisy rational ResNet than normal ReLU ResNet. Next we proceed to analyse the final two combinations and compare a noisily trained ReLU ResNet with a normal and noisy Rational ResNet. In Figure <ref type="figure">12c</ref> we compare the noisily trained ReLU ResNet(R20n) to a normally trained Rational ResNet(R20r), analogous to the observations that were made in Section 6.2, we observe that the last layers of noisy ReLU ResNet(R20n) are very dissimilar to all layers of Rational ResNet(R20r) and as one would expect the subspace reconstruction and linear probe performance of the noisy relu resnet is lower than that of the normally trained rational network but the underlying manifold is still more linearly separable. Just like how the normally trained relu resnet shared similarities with all the layers of a similarly trained rational resnet as shown earlier in Figure <ref type="figure">12a</ref>, the noisy relu resnet does the same but for the last layers, thereby also indirectly offering a corroboration of results in Section 6.2 where we demonstrated that normally and noisily trained relu networks tend to differ only in the later layers. Finally comparing noisily trained versions of both networks in Figure <ref type="figure">12d</ref> we again observe that the final layers of noisy relu resnet are not similar to any layer of its noisy rational counterpart. Even though the linear separability (&#945; M ) of manifolds in the final layer representations are different, both the networks exhibit similar linear probe and subspace reconstruction accuracies. The noisy rational network doesn't show a similar behaviour, its final layers are similar to various layers of a noisy ReLU ResNet. Figure <ref type="figure">11</ref> compares 5 noisy relu networks one by one, and it clearly demonstrates that the last few layers of each network is dissimilar from the rest. Based on these observations, the set of experiments described in this section clearly establish that structures learnt by rational networks when trained to fit noisy training data are completely different to those learnt by ReLU networks and Linear Probes, Subspace Reconstruction and MFTMA are less efficient at discovering the differences between generalising and memorising geometries in rational neural networks. Additional results for experiments in this section are presented in Appendix M and Appendix N.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Analysis of Networks trained with subspace separation loss vs classification loss</head><p>To better observe the emergence of class homogeneous self-expressive structures in deeper layers of a network we compare networks trained on cross-entropy (CE) with networks trained on Maximum Coding Rate Reduction <ref type="bibr">[22]</ref> (MCRR), which we describe next for completeness. Given a dataset X = [x 1 , . . . , x N ] &#8712; R d&#215;N coming from a disjoint union of manifolds where M = &#8852; k i=1 M i in ambient space R d and a network f (x, &#952;) : R d &#8594; R p , the Maximum Coding Rate Reduction(MCRR) <ref type="bibr">[22]</ref> training framework learns a mapping z = f (x, &#952;) &#8712; R p such that Z = [z 1 , . . . , z N ] &#8712; R p&#215;N belongs to a disjoint union of linear subspaces S = &#8852; k i=1 S i in ambient space R p . The MCRR training framework encourages the following properties -(1) Representations for inputs from different classes are uncorrelated and belong to different linear subspaces. (2) Representations for inputs from the same class are correlated and belong to the same linear subspace. (3) The dimension or volume of the space occupied by inputs from a class should be as large as possible as long as they are uncorrelated with the rest. Works like <ref type="bibr">[50]</ref>, <ref type="bibr">[51]</ref>, <ref type="bibr">[52]</ref>, <ref type="bibr">[53]</ref> try to enforce the self-expressive property in the learned representations but cannot ensure all the 3 previously listed properties in the learned representation. Given data samples X = [x 1 , . . . , x N ] and a network f (x, &#952;) where z i = f (x i , &#952;) is the learned representation for x i , thereby creating a learned representation matrix Z = [z 1 , . . . , z N ] encoded each input data point. According to <ref type="bibr">[54]</ref> the total number of bits needed to encode Z up to a precision &#1013; on a per input formulation is defined in Equation <ref type="formula">7</ref>.</p><p>high, indicating the emergence of class coherent self-expressive structures. This analysis establishes the emergence of class coherent self-expressive structures in networks trained with CE loss and also indicates that regardless of the training objective large parts of the network learn representations that are very similar, with meaningful differences emerging only much later in the network. <ref type="bibr">[55]</ref> also showed a similar late divergence of representation in networks trained with different classification losses and <ref type="bibr">[10]</ref> on the other hand demonstrates that in the case of self-supervised training of Vision Transformers <ref type="bibr">[5]</ref> the choice of objectives, namely, Joint-Embedding <ref type="bibr">[56]</ref>, <ref type="bibr">[57]</ref> vs Reconstruction based learning <ref type="bibr">[58]</ref>, <ref type="bibr">[59]</ref> leads to dissimilar features that appear quite early in a network. A more complete set of results for LRSC-CKA and Linear-CKA along with the details of the experimental setup is provided in Appendix O.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F. Connection between LRSC-CKA and Linear-CKA</head><p>Linear-CKA and LRSC-CKA are two versions of weighted sums of cosine similarity between the right-singular vectors of the original representations. Given activation matrices of layer i and j, namely X &#8712; R di&#215;N and Y &#8712; R dj &#215;N , CKA [2] computes their similarity via Equation 3. For given 2 layer wise neural activation matrices X = U X &#931; X V T X and Y = U Y &#931; Y V T Y , which are centred, we first demonstrate why Linear-CKA [2] is more sensitive to first few principal components in Section F.1 and then we demonstrate how Linear-CKA [2] is related to LRSC-CKA in Section F.2 while also showing how LRSC-CKA alleviates some shortcomings of Linear-CKA [2].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.1. Analysis of Linear-CKA</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Given centred neural activation matrices</head><p>, where each column of the matrix is the representation for a data point. Linear-CKA [2] then requires the computation of Linear-Kernel Gram Matrices as shown in Equation <ref type="formula">10</ref>.</p><p>X</p><p>) Re-writing Equation 10 as follows,</p><p>From the first part of Equation 3 and Equation <ref type="formula">11</ref>.</p><p>Where,</p><p>Letting K = X T X and L = Y T Y . As X and Y are already centred, Computing the 3 Hilbert Space Independence Criterion (HSIC) values. Computing the numerator of Equation <ref type="formula">12</ref>,</p><p>Computing the denominator of Equation <ref type="formula">12</ref>,</p><p>A similar compution with matrix Y yields,</p><p>Combining Equation <ref type="formula">15</ref>,Equation <ref type="formula">16</ref>and Equation 17 yields the formula for Linear-CKA [2] in terms of eigen-decomposition of the linear kernels of respective neural activation matrices, as shown in Equation 3 and Equation <ref type="formula">18</ref>.</p><p>Works like <ref type="bibr">[60]</ref> empirically demonstrate that the eigen-values of real world data and kernel matrices tend to decay rapidly. <ref type="bibr">[61]</ref> show that data that can derived from a latent variable model can be approximated by a low rank matrix, the proof of which is detailed in Section F.3. <ref type="bibr">[62]</ref> further provide bounds on the Singular Values of matrices with Displacement Structure and demonstrate exponential decay of singular values.</p><p>For the purpose of our analysis of Linear-CKA [2] we adopt a simplified exponential decay model over singular values from <ref type="bibr">[63]</ref>, whereas more involved results exist in <ref type="bibr">[62]</ref>.</p><p>In an exponential decay model <ref type="bibr">[63]</ref>, we assume that given an eigen-decomposition of the linear kernel matrix, its i th eigen-value &#955; i = O(&#961; &#946;i ), where &#961; &lt; 1. More concretely, for linear kernels, Given any activation's linear kernel matrix X T X = V &#931; 2 V T , let</p><p>Computing the sum of square of eigen values of any X T X, n i=1</p><p>As a consequence of Equation <ref type="formula">20</ref>,</p><p>Therefore, substituting the result in Equation 21 into the summation for i = 1 and j = 1 in Equation 18, we obtain -</p><p>Similarly, In a polynomial decay model <ref type="bibr">[63]</ref> model we assume that &#955; 2 i = O(i -&#945; ), where &#945; &gt; 1. Therefore for Linear Kernels &#955; 2 i = &#955; 2 1 i -&#945; . Therefore conducting a similar computation to Equation 20-Equation <ref type="formula">22</ref>, Computing the sum of square of eigen values of any X T X, n i=1</p><p>) , from Theoreom A.4 <ref type="bibr">[63]</ref> n i=1</p><p>Using Equation <ref type="formula">23</ref>and computing the fraction of square of first kernel eigenvalue to the sum of squares as in Equation <ref type="formula">21</ref>-</p><p>Analogously to Equation <ref type="formula">22</ref>, substituting from Equation 24 for a polynomial decay of eigen values into the summation for i = 1 and j = 1 in Equation <ref type="formula">18</ref>,</p><p>Which reveals Linear-CKA assigns a higher weight to the cosine similarity between the top right singular values of activation matrices, thereby demonstrating why Linear-CKA is insensitive to changes in most but the top singular vectors <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.2. Analysis of LRSC-CKA</head><p>Continuing the analysis further for LRSC-CKA having given the same (assuming centred) X = U X &#931; X V T X and Y = U Y &#931; Y V T Y as in Section F.1. We first compute their respective LRSC Affinity matrices C X = V X V T X and C Y = V Y V T Y by Equation 1, where V X and V Y are rank-r (assumed same for simplicity) truncated right singular vectors of X and Y respectively. Essentially when comparing LRSC-CKA with Linear-CKA we observe that LRSC Affinity is a Linear Kernel with all singular values below a cut-off threshold (rank-r, for simplicity) set to 0 and all singular values above this threshold clamped to 1. Then, the corresponding LRSC-CKA based on Equation 18 is given by Equation <ref type="formula">26</ref>.</p></div></body>
		</text>
</TEI>
