<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>On the impact of activation and normalization in obtaining isometric embeddings at initialization</title></titleStmt>
			<publicationStmt>
				<publisher>Conference on Neural Information Processing Systems</publisher>
				<date>12/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10524174</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Amir Joudaki</author><author>Hadi Daneshmand</author><author>Francis R Bach</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results do not extend to layer normalization, which is widely used in transformers, and can not quantitatively characterize the role of non-linear activations. To bridge this gap, we prove that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards the identity matrix at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function.Code is available at: https://github.com/ajoudaki/deepnet-isometry 2017], while a broader theoretical understanding of activations remains elusive. To address this issue, we develop a theoretical framework to characterize the influence of a broad range of activations on intermediate representations in DNNs.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Optimization of deep neural networks is a challenging non-convex problem. Various components and optimization techniques have been developed over the last decades to make optimization feasible. Components such as activation functions <ref type="bibr">[Hendrycks and Gimpel, 2016]</ref>, normalization layers <ref type="bibr">[Ioffe and Szegedy, 2015]</ref>, and residual connections <ref type="bibr">[He et al., 2016]</ref> have significantly influenced network training and have thus become the building blocks of neural networks. The practical success of these components has inspired extensive theoretical studies on the intricate role of weight initialization <ref type="bibr">[Saxe et al., 2013</ref><ref type="bibr">, Daneshmand et al., 2021]</ref>, normalization <ref type="bibr">[Yang et al., 2019</ref><ref type="bibr">, Kohler et al., 2019</ref><ref type="bibr">, Daneshmand et al., 2021</ref><ref type="bibr">, 2023</ref><ref type="bibr">, Joudaki et al., 2023]</ref> and activation layers <ref type="bibr">[Pennington et al., 2018</ref><ref type="bibr">, Joudaki et al., 2023]</ref>, on neural network training. For example, the training of large language models hinges on carefully utilizing residual connections, normalization layers, and tailored activations <ref type="bibr">[Vaswani et al., 2017</ref><ref type="bibr">, Radford et al., 2018]</ref>. <ref type="bibr">Noci et al. [2022a]</ref> highlight that the absence or improper utilization of these components can substantially slow training.</p><p>To delve deeper into the influence of normalization and activation layers on training, one line of research has studied neural networks at initialization <ref type="bibr">[Pennington et al., 2018</ref><ref type="bibr">, de G. Matthews et al., 2018</ref><ref type="bibr">, Jacot et al., 2018</ref><ref type="bibr">, Yang et al., 2019</ref><ref type="bibr">, Li et al., 2022]</ref>. Several studies have focused on the Gram matrix, which captures the inner products of intermediate representations for a batch of inputs, revealing that Gram matrices become degenerate as the network depth increases <ref type="bibr">[Saxe et al., 2013</ref><ref type="bibr">, Daneshmand et al., 2021</ref><ref type="bibr">, Joudaki et al., 2023]</ref>. These issue of degeneracy or rank deficiency has been observed in multilayer perceptrons (MLPs) <ref type="bibr">[Saxe et al., 2013</ref><ref type="bibr">, Daneshmand et al., 2020]</ref>, convolutional networks <ref type="bibr">[Bjorck et al., 2018]</ref>, and transformers <ref type="bibr">[Dong et al., 2021]</ref>, posing challenges to the training process <ref type="bibr">[Noci et al., 2022b</ref><ref type="bibr">, Pennington et al., 2018</ref><ref type="bibr">, Xiao et al., 2018]</ref>. Research indicates that normalization layers can act effectively to circumvent such Gram degeneracy, thereby improving training <ref type="bibr">[Yang et al., 2019</ref><ref type="bibr">, Daneshmand et al., 2020</ref><ref type="bibr">, 2021</ref><ref type="bibr">, Bjorck et al., 2018]</ref>.</p><p>Analyses based of neural networks in the mean-field, i.e., the infinite width regime, have revealed profound insights about initialization by characterizing the local solutions to the Gram dynamics <ref type="bibr">Yang et al. [2019]</ref>, <ref type="bibr">Pennington et al. [2018]</ref>. However, these results do not guarantee global convergence towards the mean-field solutions or depend on technical assumptions that are challenging to verify numerically <ref type="bibr">[Daneshmand et al., 2021</ref><ref type="bibr">, 2020</ref><ref type="bibr">, Joudaki et al., 2023]</ref>. Furthermore, all these theories primarily pertain to the network at initialization, where parameters are typically random and do not necessarily hold during or after the training. In this work, our objective is to bridge these existing gaps.</p><p>Contributions. Building upon existing literature that elucidates the spectral properties of the Gram matrix, we introduce the concept of isometry, which quantifies the similarity of the Gram matrix to the identity. Our initial theoretical finding demonstrates that isometry does not decrease under conditions of (batch and layer) normalization. This finding illuminates the bias of normalization layers towards isometry at various stages, namely initialization, during, and post-training.</p><p>We subsequently extend our analysis to explore the impact of non-linear activations on the isometry of intermediate representations in MLPs. Within the mean-field regime, we establish that non-linear activations incline the intermediate representations towards isometry at an exponential rate in depth. Our principal contribution is quantifying this rate by utilizing the Hermit polynomial expansion of activations. Intriguingly, our empirical experiments unveil a correlation between this rate and the convergence of stochastic gradient descent in MLPs equipped with layer normalization and standard activations used in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related works</head><p>A line of research investigates the interplay between signal propagation of the network and training. The existing literature postulates that in order to ensure fast training <ref type="bibr">[Schoenholz et al., 2017</ref><ref type="bibr">, Poole et al., 2016]</ref>, the network output must be sensitive to input changes, quantified by the spectrum of input-input Jacobean. This hypothesis is employed by <ref type="bibr">Xiao et al. [2018]</ref> to train a 10,000-layer CNN using proper weight initialization without stabilizing components such as skip connection or normalization layers. <ref type="bibr">He et al. [2023]</ref> demonstrate the critical role of the Jacobean spectra in large language models. In this paper, we analyze the spectrum of Gram matrices that connect to the spectral properties of input-output Jacobean.</p><p>Mean-field theory has been extensively used to characterize Gram matrix dynamics in the limit of infinite width. In this setting, the Gram matrix is a fixed point of a recurrence equation that depends on the network architecture <ref type="bibr">[Schoenholz et al., 2017</ref><ref type="bibr">, Yang et al., 2019</ref><ref type="bibr">, Pennington et al., 2018]</ref>. This fixed-point analysis can provide insights into the structure and spectral properties of Gram matrices in deep neural networks, thereby shedding light on the degeneracy of Gram matrices in networks <ref type="bibr">[Schoenholz et al., 2017</ref><ref type="bibr">, Yang et al., 2019]</ref>. However, often fixed-points are not unique, and they can be degenerate or non-degenerate <ref type="bibr">Yang et al. [2019]</ref>. In this paper, we establish a convergence rate to a non-degenerate fixed-point for a family of MLPs.</p><p>Batch normalization <ref type="bibr">[Ioffe and Szegedy, 2015]</ref> and layer normalization <ref type="bibr">[Ba et al., 2016]</ref> layers are widely used in deep neural networks (DNNs) to improve training. Batch normalization ensures that each feature within a layer across a mini-batch has zero mean and unit variance. In contrast, layer normalization centers and divides the output of each layer by its standard deviation. There have been numerous theoretical studies on the effects of batch normalization due to its popularity <ref type="bibr">Yang et al. [2019</ref><ref type="bibr">], Daneshmand et al. [2021]</ref>, <ref type="bibr">Joudaki et al. [2023]</ref>. While layer normalization has been the subject of increasing interest due to its application in transformers <ref type="bibr">Xiong et al. [2020]</ref>, there are relatively fewer studies on its theoretical underpinnings. While we primarily focus on layer normalization, we define and characterize a property that is shared between batch and layer normalization.</p><p>A broad spectrum of activation functions such as ReLU <ref type="bibr">[Fukushima, 1969]</ref>, GeLU <ref type="bibr">[Hendrycks and Gimpel, 2016]</ref>, SeLU <ref type="bibr">[Klambauer et al., 2017]</ref>, and Hyperbolic Tangent, and Sigmoid, are used in DNNs. These functions have various computational and statistical consequences in deep learning. Despite this diversity, only the design of SeLU activation is theoretically motivated <ref type="bibr">[Klambauer et al.,</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Preliminaries</head><p>Notation. Let &#10216;x, y&#10217; be the inner product of vectors x and y, and &#8741;x&#8741; 2 = &#10216;x, x&#10217; the squared Euclidean norm of x. For a matrix X, we write X i&#8226; and X &#8226;i for the i-th row and column of X, respectively. We use W &#8764; N (&#181;, &#963; 2 ) m&#215;n to indicate that W is an m &#215; n Gaussian matrix with i.i.d. elements from N (&#181;, &#963; 2 ). We denote by 0 n the zero vector of size n. Given vector x &#8712; R n , x denotes the arithmetic mean of 1 n n i=1 x i . Lastly, I n is the identity matrix of size n.</p><p>Normalization layers. Let LN : R d &#8594; R d and BN : R d&#215;n &#8594; R d&#215;n , denote batch normalization and layer normalization respectively. Table <ref type="table">1</ref> summarizes the definition of normalization layers. In our notations, we separate centering from normalization in layer (batch) normalization. Similarly, we split batch normalization into centering and normalization steps in our definitions. This notation allows us to decouple the effect of normalization from the centering. However, we will not depart from the standard MLP architectures as we include centering in the network architecture defined below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Width</head><p>Table <ref type="table">1</ref>: Building blocks we consider in this work.</p><p>MLP setup. The subject of our analysis is an MLP with constant width d across the layers and L layers, which takes input x &#8712; R d and maps it to output x L &#8712; R d , with hidden representations as</p><p>input.</p><p>(1)</p><p>While the original ordering of layer normalization and activation is different <ref type="bibr">[Ba et al., 2016]</ref>, <ref type="bibr">Xiong et al. [2020]</ref> show that the above ordering is more effective for large language models.</p><p>Gram matrices and isometry. Given n data points {x i } i&#8804;n &#8712; R d , the Gram matrix G &#8467; of the feature vectors x &#8467; 1 , . . . , x &#8467; n &#8712; R d at layer &#8467; of the network is defined as</p><p>We define the notion of isometry to measure how much G &#8467; is close to a scaling factor of the identity matrix.</p><p>Definition 1. Let G be an n &#215; n positive semi-definite matrix. We define the isometry I(G) of G as the ratio of its normalized determinant to its normalized trace:</p><p>.</p><p>(3)</p><p>) is a scale-invariant quantity measuring the parallelepiped volume spanned by the feature vectors x &#8467; 1 , . . . , x &#8467; n . For example, consider two points on a plane</p><p>The ratio is given by ab sin(&#952;)/(a 2 + b 2 ), which is maximized when a = b and &#952; = &#960;/2. This relationship between volume and isometry is visually clear n = 2 and n = 3 feature vectors in Figure <ref type="figure">1</ref>.</p><p>Remarkably, I(M ) has the following properties (see Lemma A.1 for formal statements and proofs):</p><p>Figure <ref type="figure">1</ref>: A geometric interpretation of isometry: higher volume, in the second row, corresponds to higher value for isometry.</p><p>(i) Scaling-invariant: For all constants c &gt; 0, we have I(G) = I(cG).</p><p>(ii) Range: I &#8712; [0, 1] where the boundaries 0 and 1 are achieved for to degenerate and identity matrices respectively.</p><p>We also define the isometry gap as negative logarithm of isometry -log I(M ). Based on these properties of isometry, isometry gap lies between 0 and &#8734;, with 0 and &#8734; indicating the perfect isometry (identity matrix) and degenerate matrices respectively. Isometry allows us to establish the inherent bias of normalization layers in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Isometry bias of normalization</head><p>This section is devoted to discussing the remarkable property of isometry in the context of normalization. We present a theorem that formalizes this property, followed by its geometric interpretation and implications.</p><p>Theorem 1. Given n samples {x i } i&#8804;n &#8834; R d \ {0 d }, their projection onto the unit sphere x i := x i /&#8741;x i &#8741;, and their respective Gram matrices G and G, the isometry obeys</p><p>where a i := &#8741;x i &#8741;, and &#257; := 1 n n i a i .</p><p>Geometric Interpretation. Isometry can be considered a measurement of the "volume" of the parallelepiped formed by sample vectors, made scale and dimension-independent. The normalization process effectively equalizes the edge lengths of this parallelepiped, enhancing the overall "volume" or isometry, provided there is a variance in the sample norms. Thus, projection onto the unit sphere x i &#8594; x i /&#8741;x i &#8741; makes the edge lengths of the parallelepiped equal while leaving the angles between its edges intact. From this geometric perspective, Theorem 1 implies that among parallelepiped with similar angles between their edges and fixed total squared edge lengths, the one with equal edge lengths has the highest volume (and thereby isometry).</p><p>The proof of Theorem 1 is intuitive for the special case of vectors forming a cube, where max volume is realized when all edge lengths are equal. This fact that maximum volume is achieved when edge lengths are equal can be deduced from the arithmetic vs geometric mean inequality. Strikingly, the proof for the general case is nearly as simple as this special case. The high-level intuition behind the proof is that the determinant allows us to decouple the role of angles and edge lengths in volume formulation. This fact is evident for n = 2 in Figure <ref type="figure">1</ref>. Since normalization does not modify the angles between edges, the remainder of the proof falls back onto the case where edges form a cube.</p><p>Proof of Theorem 1.</p><p>Theorem 1 further shows a subtle property of normalization: as long as there is some variation in the sample norms, i.e., &#8741;x i &#8741;'s are not all equal, the post-normalization Gram has strictly higher isometry than the pre-normalization Gram matrix. It further quantifies the improvement in isometry as a function of variation of norms. Intuitively, terms a and 1 n n i (a i -&#257;) 2 can be interpreted as the average and variance of sample norms a 1 , . . . , a n . Thus, a higher variation in the norms a i 's leads to a larger increase in isometry after normalization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Implications for layer (and batch) normalization</head><p>Theorem 1 reveals insights into the biases introduced by layer and batch normalization in neural networks, particularly highlighting the improvement in isometry not just limited to initialization but also persistent through the training process. Corollary 2. Consider n vectors before and after layer-normalization</p><p>Define their respective Gram matrices G := [&#10216;x i , x j &#10217;] i,j&#8804;n , and G := [&#10216; x i , x j &#10217;] i,j&#8804;n . We have:</p><p>What makes the above result distinct from related studies <ref type="bibr">[Daneshmand et al., 2021</ref><ref type="bibr">, 2020</ref><ref type="bibr">, Yang et al., 2019]</ref> is that the increase in isometry is not limited to random initialization. Thus, layer normalization increases the isometry even during and after training. This calls for future research on the role of this inherent bias in enhanced optimization and generalization performance with batch normalization <ref type="bibr">[Ioffe and Szegedy, 2015</ref><ref type="bibr">, Yang et al., 2019</ref><ref type="bibr">, Lyu et al., 2022</ref><ref type="bibr">, Kohler et al., 2019]</ref>.</p><p>Despite the seemingly vast differences between layer normalization and batch normalization <ref type="bibr">[Lubana et al., 2021]</ref>, the following corollary shows a link between these two different normalization techniques.</p><p>Corollary 3. Given n samples in a mini-batch before X &#8712; R d&#215;n , and after normalization X = BN(X) and define covariance matrices C := XX &#8868; and C := XX &#8868; . We have:</p><p>where</p><p>Gram matrices of networks with batch normalization have been the subject of many previous studies at network initialization: it has been postulated that BN prevents rank collapse issue [Daneshmand</p><p>2 act-2 norm-3 fc-3 act-3 norm-4 fc-4 act-4 norm-5 fc-5 act-5 norm-6 fc-6 act-6 norm-7 fc-7 act-7 norm-8 fc-8 act-8 norm-9 fc-9 act-9 norm-10 fc-10 act-10 Layer 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Isometry (G ) initialization norm-0 fc-0 act-0 norm-1 fc-1 act-1 norm-2 fc-2 act-2 norm-3 fc-3 act-3 norm-4 fc-4 act-4 norm-5 fc-5 act-5 norm-6 fc-6 act-6 norm-7 fc-7 act-7 norm-8 fc-8 act-8 norm-9 fc-9 act-9 norm-10 fc-10 act-10 Layer epoch 9 activation normalization Figure 2: Validation of Corollary 2 Isometry (y-axis) vs different layer of an MLP: Normalization layers (shaded blue) across all layers and configurations maintain or increase isometry both before (left) and after (right) training, validating Corollary 2. Hyper parameters: activation: tanh, depth: 10, width: 1000, batch-size: 512, training SGD on training set of CIFAR10. with lr = 0.01.. Layer names are encoded as type-index, where type can be fc: fully connected, norm: LayerNorm, and act: activation. et al., 2020] and that it orthogonalizes the representations [Daneshmand et al., 2021], and that it imposes isometry <ref type="bibr">[Yang et al., 2019]</ref>. It is straightforward to verify that orthogonal matrices have the maximum isometry. Thus, the increase in isometry links to the orthogonalization of hidden representation characterized by <ref type="bibr">Daneshmand et al. [2021]</ref>. While all previous results heavily rely on Gaussian random weights to establish this inherent bias, Corollary 3 is not limited to random weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Empirical validation of Corollary 2 in an MLP setup</head><p>We can validate Corollary 2 by tracking the isometry of various layers of an MLP with layer normalization. Figure <ref type="figure">2</ref> shows the isometry of intermediate representations in an MLP with layer normalization and hyperbolic tangent on CIFAR10 dataset. Shades in the figure mark layers illustrate that the isometry of the Gram matrix is non-decreasing after each layer normalization layer. We can see in Figure <ref type="figure">2</ref> that both before (left) and after training (right), the normalization layers maintain or improve isometry. To highlight the fact that the claims of Corollary 2 holds at all times and not only for initialization, Figure <ref type="figure">2</ref> tracks isometry of various layers both at initialization (left) and after training (right). This can be verified by the fact that isometry in the normalization layers (shaded blue) is either stable or increased, which validates Corollary 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Isometry bias of non-linear activation functions</head><p>So far, our focus was individual normalization layers. In this section, we extend our analysis to all layers when weights are Gaussian. Inspired by the isometry bias of normalization, we analyze how other components of neural networks influence the isometry of the Gram matrices, denoted by I(G &#8467; ) with a specific focus on non-linear activations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Hermite expansion of activation functions</head><p>Analyzing Gram matrix dynamics for non-linear activation is challenging since even small modifications in the scale or shape of activations can lead to significant changes in the representations. A powerful tool to analyze activations is to express activations in the Hermite polynomial basis. Inspired by previous successful applications of Hermite polynomials in neural network analysis <ref type="bibr">[Daniely et al., 2016</ref><ref type="bibr">, Yang, 2019]</ref>, we explore their impact on the isometry of activation functions.</p><p>Definition 2. Hermite polynomial of degree k, denoted by He k (x), is defined as</p><p>All square-integrable function &#963; with respect to the Gaussian kernel, which obeys &#8734; -&#8734; &#963;(x) 2 e -x 2 /2 dx &lt; &#8734;, can be expressed as a linear combination of Hermite polynomials as &#963;(x) = k c k He k (x) with (see section A for more details):</p><p>The subsequent section will discuss how to leverage the Hermite expression of the activation to analyze the dynamics of Gram matrix isometry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Non-linear activations bias Gram dynamics towards isometry</head><p>In this section, we analyze how I(G &#8467; ) changes with &#8467;. We use the mean-field dynamic of Gram matrices subject of previous studies <ref type="bibr">[Yang et al., 2019</ref><ref type="bibr">, Schoenholz et al., 2017</ref><ref type="bibr">, Poole et al., 2016]</ref>.</p><p>The mean-field dynamics of Gram matrices is given by</p><p>This equation gives the expected Gram matrix for layer &#8467; + 1, based on the Gram matrix G &#8467; * from the previous layer, and &#981; is mean-field regime counterpart for layer normalization operator (see section A for more details). The sequence G &#8467; * approximates the dynamics of G &#8467; , and this correspondence becomes exact for infinitely wide MLPs. In the rest of this section, we analyze the above dynamical system. Our theory relies on the notion of isometry strength of the activation function, defined next. Definition 3 (Isometry strength). Given activation &#963; with Hermite expansion {c k } k&#8805;0 , define its isometry strength &#946; &#963; as:</p><p>We can readily check from the definition that isometry strength &#946; &#963; has the following basic properties: (i) it ranges between 1 and 2, and (ii) it is 1 if and only if the activation is a linear function. Table <ref type="table">2</ref> presents the isometry strength of certain activations in closed form. With this definition, we can finally analyze Gram matrix mean-field dynamics. Interestingly, the negative log of isometry -log I(G &#8467; * ) can serve as a Lyapunov function for the above dynamics. The following theorem proves non-linear activations also impose isometry similar to normalization layers. Theorem 4. Let &#963; be an activation function with a Hermite expansion and a non-linearity strength &#946; &#963; , (see equation (11)). Given non-degenerate input Gram matrix G 0 * , then for sufficiently large layer &#8467; &#8819; &#946; -1 &#963; (-n log I(G 0 * ) + log(4n)), we have</p><p>Note that the condition on input being non-degenerate is essential to reach isometry through depth. For example, if the input batch contains a duplicated sample, their corresponding representations across all layers will remain duplicated, implying that all G &#8467; * 's will be degenerate. Theorem 4 reveals the importance of non-linear Hermite coefficients (c k , k &#8805; 2) in activation function to ensure &#946; &#963; &gt; 1 and obtain isometry in depth. This connection between &#946; &#963; and isometry is the rationale for referring to &#946; &#963; as the isometry strength. This constant can be computed in closed form for various activations, as shown in Table <ref type="table">2</ref>, and for all other activations, it can be computed numerically by sampling.</p><p>Table 2: Isometry strength &#946; &#963; (see Definition 3) for various activation functions.</p><p>Figure <ref type="figure">3</ref> compares the established bound on the isometry gap with those observed in practice, i.e. G &#8467; , for three activations. We observe &#946; &#963; predicts the decay rate in isometry of Gram matrices G &#8467; . While so far, we have only discussed the direct results of our theory for isometry of Gram matrices, in the next section, we will discuss other insights from the above analysis. 6 Implications of our theory</p><p>In this section, we elucidate the implications of our theory, beginning with insights into layer normalization through the Hermit expansion of activation functions, followed by an examination of its impact on training.</p><p>Layer normalization primarily involves two steps: (i) centering and (ii) normalizing the norms.</p><p>Through the Hermit expansion of activation functions, we unravel the underlying intricacies of these components and propose alternatives based on insights from the Hermit expansion.</p><p>Experimental Setup. Our experiments utilize MLPs with layer normalization and various activation functions for the task of image classification on the CIFAR10 dataset <ref type="bibr">[Krizhevsky et al.]</ref>. For training, we use stochastic gradient descent (SGD) with a fixed step size of 0.01. Unless stated otherwise, the MLP has constant width of d = 1000 is maintained across hidden layers, and a batch size of n = 10. Throughout this section, a &#8467; and x &#8467; respectively denote the post-activation and post-normalization vector for layer &#8467;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Centering and Hermit expansion</head><p>One of the key insights of our mean-field theory is that the centering step in layer normalization is crucial in obtaining isometry. In the mean-field regime, pre-activations follow standard Gaussian distributions, and thus the average post-activation a &#8467; = 1 d d i a &#8467; i will converge to their expectation a &#8467; = E z&#8764;N (0,1) [&#963;(X)] = c 0 . This insight suggests an alternative way of obtaining isometry by explicitly removing the offset term from activation, i.e., replacing activation &#963;(x) by &#963;(x) -c 0 . Strikingly, our experiments presented in Figure <ref type="figure">4</ref> indicate that such replacement can also impose isometry. This result provides novel insights into the role of centering in layer normalization. Mean-field centering (blue): x &#8467;+1 = LN(a &#8467; -c0).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Normalization and Hermit expansion</head><p>Theorem 4 further reveals the importance of normalization of norms in addition to centering to achieve isometry. Figure <ref type="figure">5</ref> underlines the importance of the normalization for different activations, where we observe the isometry gap may increase without normalization. Similar to our mean-field analysis of centering, the factor 1 d d i=1 (a &#8467; i -a &#8467; ) 2 in layer normalization converges to variance var z&#8764;N (0,1) (&#963;(z)) = &#8734; k=1 c 2 k =: &#963;(1). Thus, as the width increases, the layer normalization operator LN(a &#8467; -a &#8467; ) will converge to (a &#8467; -c 0 )/ &#963;(1). Figure <ref type="figure">6</ref> demonstrates that the constant scaling 1/ &#963;(1), achieves comparable isometry to layer normalization for hyperbolic tangent and sigmoid and ReLU, while it is not effective for exp function. This observation calls for future research on the link between normalization and activation in deep neural networks. 0 10 20 30 10 1 10 0 10 1 log (G ) relu 0 10 20 30 tanh 0 10 20 30 sigmoid 0 10 20 30 exp Figure 6: Comparing layer normalization (blue) x &#8467;+1 = LN(a &#8467; -c0) vs. mean-field normalization (orange)</p><p>in obtaining isometry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Isometry strength correlates with SGD convergence rate in shallow MLPs.</head><p>Besides the direct consequences of our theory, we observe a striking correlation between the convergence of SGD and isometry strength in a specific range of neural network hyper-parameters. Figure <ref type="figure">7</ref> shows the convergence of SGD is faster for activations with a significantly larger isometry strength &#946; &#963; (see Definition 3) for shallow MLPs, e.g., with 10 layers or less. We can speculate that this correlation reflects the input-output sensitivity of the networks with higher non-linearity. Surprisingly, this correlation does not extend to deeper networks. This discrepancy between shallow and deep networks regarding SGD convergence may be due to the issue of gradient explosion studied by <ref type="bibr">Meterez et al. [2023]</ref>. This finding suggests multiple avenues for future research. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Discussion</head><p>In this study, we explored the influence of layer normalization and nonlinear activation functions on the isometry of MLP representations. Our findings open up several avenues for future research.</p><p>Self normalized activations. It is worth investigating whether we can impose isometry without layer normalization. Our empirical observations suggest that certain activations, such as ReLU, require layer normalization to attain isometry. In contrast, other activations, which can be considered as "self-normalizing" (e.g., SeLU <ref type="bibr">[Klambauer et al., 2017]</ref> and hyperbolic tangent), can achieve isometry with only offset and scale adjustments (see Figure <ref type="figure">8</ref>). We experimentally show how we can replace centering and normalization by leveraging Hermit expansion of activation. Thus, we believe Hermit expansion provides a theoretical grounding to analyze the isometry of SeLU.</p><p>0 5 10 15 20 25 30 10 0 log (G ) relu tanh selu 0 5 10 15 20 25 30 10 1 10 0 log (G ) Impact of the ordering of normalization and activation layers on isometry. Theorem 4 highlights that the ordering of activation and normalization layers has a critical impact on the isometry. Figure <ref type="figure">8</ref> demonstrates that a different ordering can lead to a non-isotropic Gram matrix. Remarkably, the structure analyzed in this paper is used in transformers <ref type="bibr">[Vaswani et al., 2017</ref>].</p><p>Normalization's role in stabilizing mean-field accuracy through depth. Numerous theoretical studies conjecture that mean-field predictions may not be reliable for considerably deep neural networks <ref type="bibr">[Li et al., 2021</ref><ref type="bibr">, Joudaki et al., 2023]</ref>. Mean-field analysis incurs a O(1/ &#8730; width) error per layer when the network width is finite. This error may accumulate with depth, making mean-field predictions increasingly inaccurate with an increase in depth. However, Figure <ref type="figure">9</ref> illustrates that layer normalization controls this error accumulation through depth. This might be attributable to the isometry bias induced by normalization, as proven in Theorem 1. Similarly, batch normalization also prevents error propagation with depth by imposing the same isometry <ref type="bibr">[Joudaki et al., 2023]</ref>. This observation calls for future research on the essential role normalization plays in ensuring the accuracy of mean-field predictions. Mean-field normalization (blue):</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix outline</head><p>The appendix is partitioned into four main components, each serving its purpose as described:</p><p>1. Section A details the all the proofs, with most of it dedicated to proof of Theorem 4 alongside numerical confirmation of significant steps &#8226; Section A.1 elaborates on the basic properties of isometry.</p><p>&#8226; Section A.2 provides an elaborate review of the mean-field Gram dynamics.</p><p>&#8226; Section A.3 presents the Lyapunov function &#947;, and establishes that this function provides both upper and lower bounds for the isometry, thereby implying that geometric contraction of &#947;(G &#8467; ) indicates a geometric contraction of isometry gap -log I(G &#8467; ). &#8226; Section A.4 proves that &#947;(G &#8467; ) exhibits an exponential contraction in depth with rate &#946;&#963;. 2. Section B outlines our rebuttal responses to the reviews that we chose to leave out of the main text.</p><p>&#8226; Section B.1 gives additional details concerning the experiments reported in the main text and appendix. &#8226; Section B.2 explores the effect of gain on isometry, and links the rate to the associated isometry strength. &#8226; Section B.3 explores the effect of varying widths of hidden layers on the isometry. &#8226; Section B.4 explores the notion of isometry for representations in language models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Proofs</head><p>A.1 Basic properties of isometry Basic properties of isometry It is straightforward to check isometry obeys the following basic isometrypreserving properties: Lemma A.1. For PSD matrix M, the isometry defined in (3) obeys the following properties: 1) scale-invariance I(cM ) = I(M ), 2) only takes value in the unit range I(M ) &#8712; [0, 1] 3) it takes its maximum value if and only if M is identity I(M ) = 1 &#8656;&#8658; M = In, and 3) takes minimum value if and only if M is degenerate</p><p>Proof of Lemma A.1. The scale-invariance is trivially true as scaling M by any constant will scale det(M ) 1/n and tr(M ) by the same amount. The proof of other properties is a straightforward consequence of writing the isometry in terms of the eigenvalues I(M ) = ( i &#955;i) 1/n /( 1 n i &#955;i), where &#955;i's are eigenvalues of M. By arithmetic vs geometric mean inequality over the eigenvalues we have ( i &#955;i) 1/n &#8804; 1 n i &#955;i), which proves that I(M ) &#8712; [0, 1]. Furthermore, the inequality is tight iff the values are all equal &#955;1 = &#8226; &#8226; &#8226; = &#955;n, which holds only for identity M = In. Finally, isometry is zero iff at least one eigenvalue is zero, which is the case for degenerate matrix M.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Mean-field Gram Dynamics</head><p>Recall the mean-field Gram dynamics stated in equation ( <ref type="formula">10</ref>):</p><p>Assuming that inputs are encoded as columns of X, we can restate the MLP dynamics as follows</p><p>where centering and layer normalization are applied column-wise, as defined in the main text. Observe that Gram matrix of representations can be written as</p><p>where &#8855; denotes Hadamard product, and subscript d emphasises the dependence of Gram on width d. Note that conditioned on the previous layer, rows of H &#8467; and A &#8467; are i.i.d. , because of independence of rows of W &#8467; . Thus, by law of large numbers, in the infinitely wide network regime, &#181;i and si will converge to the expected mean and variance respectively lim d&#8594;&#8734; &#181;i = EA &#8467; 1i , and lim d&#8594;&#8734; si = Var(A &#8467; 1i ), for all i = 1, . . . , n. By construction of &#981;, in the infinitely wide regime, we can rewrite Gram dynamics as</p><p>). We can invoke the fact that rows of A &#8467; are i.i.d. to conclude that G d is the sample Gram matrix that converges to its expectation</p><p>where * denotes the mean-field regime d &#8594; &#8734;. This concludes the connection between the mean-field Gram dynamics and infinitely wide Gram dynamics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Introducing a potential</head><p>Here we will introduce a Lyapunov function that enables us to precisely quantify the isometry of activations deep networks: Definition 4. Given a positive semidefinite matrix G &#8712; R n&#215;n , we define &#947; : R n&#215;n &#8594; R &#8805;0 as:</p><p>Remarkably, &#947; obey exhibits an geometric contraction under one MLP layer update, which is stated in the following theorem: Theorem A.2. Let G be PSD matrix with unit diagonals Gii = 1, i = 1, . . . , n. It holds:</p><p>Thus, we may apply Theorem A.2 iteratively to prove that in the mean-field, the Lyapunov function &#947;(G &#8467; ) decays at an exponential rate &#946;&#963;. A straightforward induction over layers leads to a decay rate in &#947;, which is presented in the next corollary. </p><p>where G 0 * denotes the input Gram matrix.</p><p>In Interestingly, we can connect the Lyapunov &#947; to the isometry, by proving an upper and lower based on the determinant G based on &#947;(G), when G is PSD and has unit diagonals: Lemma A.4. For PSD matrix G with unit diagonals holds:</p><p>where the lower bound holds if (n -1)|Gij| &#8804; 1.</p><p>Finally, we have the tools to prove the first main theorem.</p><p>Proof of Theorem A.2. Let a = &#963;(h) for h &#8764; N (0, G). Note that by definition of the dual-activation we have Eaa &#8868; = [&#963;(Gij)] i,j&#8804;n . Since we assumed G has unit diagonals, we have hi &#8764; N (0, 1), which implies that Eai = E h i &#8764;N (0,1) &#963;(hi) = c0, implying that E(ai -Eai)(aj -Eaj) &#8868; = &#963;(Gij)-c 2 0 = &#963;(Gij). Furthermore, the variance can be driven as <ref type="formula">1</ref>) for all i = 1, . . . , n. Thus, we have E&#981;(ai)&#981;(aj) = &#963;(Gij)/&#963;(1). In the matrix form we have</p><p>The remainder proof relies on the following contractive property of Gram matrix potential:</p><p>Lemma A.7. Consider activation &#963;, with normalized Hermite coefficients {c k } k&#8805;0 . For all &#961; &#8712; (0, 1), the mean-reduced dual activation &#963; obeys</p><p>which the right hand-side is strictly larger if some nonlinear coefficient is nonzero c k &#824; = 0 for some k &#8805; 2.</p><p>Thus we can apply Lemma A.7 on each element i &#824; = j to conclude that</p><p>since the inequality holds for any value of i &#824; = j, we can take the maximum over i &#824; = jj to write:</p><p>which concludes the proof.</p><p>Proof of Lemma A.7. Note the ratio is invariant to scaling of &#8734; k=1 c 2 k . Hence, we assume &#8734; k=1 c 2 k = 1 without loss of generality. With this simplification, we have &#963;(1) = 1. For the positive range &#961; &#8712; [0, 1] we have</p><p>Thus for &#961; &#8712; [0, 1] we have</p><p>By Jensen inequality for convex function x &#8594; |x| we have</p><p>Codes and reproducibility. We implemented our experiments in Python using the PyTorch framework <ref type="bibr">Paszke et al. [2019]</ref>. All the figures are reproducible with the code attached in the supplementary.</p><p>Training procedure For all training-related experiments, the isometry or isometry gap are computed per each batch by sampling a few of the batches randomly, and then averaged over. The epoch i corresponds to the network at after i steps of training on the training set of CIFAR10 (epoch 0 means network is at initialization).</p><p>Pre-trained large language models The pre-trained language models and their default configuration was downloaded from Huggingface <ref type="bibr">Wolf et al. [2020]</ref> library.</p><p>B.2 Quantifying the influence of gain on isometry through non-linearity strength</p><p>The concept of gain in neural networks is vital and closely connected with the weights initialization. A neural network with properly initialized weights can learn faster, have a lesser chance of getting stuck at sub-optimal solutions, and provide better generalization. The impact of gain can be visualized through the lens of weight initialization strategies such as Xavier normalization <ref type="bibr">[Glorot and Bengio, 2010]</ref>, which has shown significant effectiveness in optimizing neural networks. These initialization strategies apply a gain value to the weights, which is a scaling factor, to ensure a good signal flow through many layers during the forward and backward passes. The gain value essentially determines the variance of the weights in the initialization stage.</p><p>As an extension to our prior investigations, we delve into understanding the influence of gain on isometry, predominantly through our calculated metric, the non-linearity strength, denoted as &#946;&#963;, as a function of gain &#945;.</p><p>For certain instances, such as ReLU, sine, and exponential activations, we are capable of deriving &#946;&#963;(&#945;) in a closed form. Table <ref type="table">B</ref>.1 presents a few of these cases.</p><p>&#963; exp(&#945;x) sin(&#945;x) max(&#945;x, 0)</p><p>Table B.1: Relationship between non-linearity strength and gain.  Comparison to Xavier gain for initialization Inspired by the results so far, we can compare mean-field centering and normalization to Xavier gain for activations. Figure B.3 demonstrates that all mean-field based gains improve the isometry when compared with Xavier initialization. However, this is markedly stronger for ReLU and leaky ReLU. We can explain this starker contrast by the fact that both activations have a significant offset term c0, which is not corrected by the Xavier initialization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Varying width of hidden layers</head><p>While in our theoretical setup, we assume the network width is constant across the layers, this is only a choice to streamline our proof and notation. Since our primary result is derived from the mean-field regime, the 0 10 20 30 10 0 log (G ) selu 0 10 20 30 leaky_relu 0 10 20 30 relu 0 10 20 30 tanh MF xavier Figure B.3: Isometry vs depth for mean-field centering and normalization to Xavier initialization.</p><p>only criterion for it to hold is for the width to be sufficiently large to approximate the mean-field regime. Our experiments in Figure <ref type="figure">B</ref>.4 substantiate this claim that the specific sizes of hidden layers, as long as they are large, will not impact our main results on the isometry. We empirically validate this for four different configurations and show that the decay of the isometry gap remains largely consistent across these configurations.</p><p>act-0 norm-0 fc-0 act-1 norm-1 fc-1 act-2 norm-2 fc-2 act-3 norm-3 fc-3 act-4 norm-4 fc-4 act-5 norm-5 fc-5 Layer 0.2 0.4 0.6 0.8 1.0 1.2 Isometry Gap dataset = MNIST act-0 norm-0 fc-0 act-1 norm-1 fc-1 act-2 norm-2 fc-2 act-3 norm-3 fc-3 act-4 norm-4 fc-4 act-5 norm-5 fc-5</p><p>Layer dataset = CIFAR10 hidden dimensions <ref type="bibr">[1000, 1000, 1000, 1000, 1000] [512, 1024, 2048, 4096, 8192] [8192, 4096, 2048, 1024, 512] [1024, 2048, 1024, 2048, 1024]</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 Isometry in pre-trained large language models</head><p>Since our theory for normalization is not limited to initialization, we can expand our search for isometry to other architectures. Figure <ref type="figure">B</ref>.5 shows the important role of normalization in the pre-trained GPT2 network. However, we need to adjust the notion of isometry with the architecture of layer norm in a transformer in mind. In fact, the mean and standard deviation are computed over features separately for each token. Thus, to adapt the notion of isometry, we can view each token as a sample and define Gram over different tokens. Thus, isometry here quantifies the similarity between various tokens within one sample. As can be seen in the figure below, LayerNorm layers (shaded in red) in the last six layers of the pre-trained GPT2 increase the isometry between tokens, which is consistent with our theory of layer normalization. It is crucial that our theory holds deterministically, which extends to the pre-trained model. B.5 Tracking isometry at initialization and optimization for more activations there are more numerical experiments related to to tracking isometry before and after training. norm-0 fc-0 act-0 norm-1 fc-1 act-1 norm-2 fc-2 act-2 norm-3 fc-3 act-3 norm-4 fc-4 act-4 norm-5 fc-5 act-5 norm-6 fc-6 act-6 norm-7 fc-7 act-7 norm-8 fc-8 act-8 norm-9 fc-9 act-9 norm-10 fc-10 act-10 Layer 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Isometry (G ) initialization norm-0 fc-0 act-0 norm-1 fc-1 act-1 norm-2 fc-2 act-2 norm-3 fc-3 act-3 norm-4 fc-4 act-4 norm-5 fc-5 act-5 norm-6 fc-6 act-6 norm-7 fc-7 act-7 norm-8 fc-8 act-8 norm-9 fc-9 act-9 norm-10 fc-10 act-10 Layer epoch 9 Sigmoid, = 1.02 SELU, = 1.03 ELU, = 1.06 Tanh, = 1.07 SiLU, = 1.2 ReLU, = 1.27 activation normalization Figure B.6: Validation of Corollary 2 and Theorem 4 for multiple activations.</p><p>The following plot shows that isometry gap remains relatively stable. </p></div></body>
		</text>
</TEI>
