<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Metric flows with neural networks</title></titleStmt>
			<publicationStmt>
				<publisher>IOP Publishing</publisher>
				<date>10/23/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10555692</idno>
					<idno type="doi">10.1088/2632-2153/ad8533</idno>
					<title level='j'>Machine Learning: Science and Technology</title>
<idno>2632-2153</idno>
<biblScope unit="volume">5</biblScope>
<biblScope unit="issue">4</biblScope>					

					<author>James Halverson</author><author>Fabian Ruehle</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>We develop a general theory of flows in the space of Riemannian metrics induced by neural network (NN) gradient descent. This is motivated in part by recent advances in approximating Calabi–Yau metrics with NNs and is enabled by recent advances in understanding flows in the space of NNs. We derive the corresponding metric flow equations, which are governed by a metric neural tangent kernel (NTK), a complicated, non-local object that evolves in time. However, many architectures admit an infinite-width limit in which the kernel becomes fixed and the dynamics simplify. Additional assumptions can induce locality in the flow, which allows for the realization of Perelman’s formulation of Ricci flow that was used to resolve the 3d Poincaré conjecture. We demonstrate that such fixed kernel regimes lead to poor learning of numerical Calabi–Yau metrics, as is expected since the associated NNs do not learn features. Conversely, we demonstrate that well-learned numerical metrics at finite-width exhibit an evolving metric-NTK, associated with feature learning. Our theory of NN metric flows therefore explains why NNs are better at learning Calabi–Yau metrics than fixed kernel methods, such as the Ricci flow.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>There are no known nontrivial compact Calabi-Yau metrics, objects of central importance in string theory and algebraic geometry, despite decades of study.</p><p>The essence of the problem is that theorems by Calabi <ref type="bibr">[1]</ref> and Yau <ref type="bibr">[2,</ref><ref type="bibr">3]</ref> guarantee the existence of a Ricci-flat K&#228;hler metric (Calabi-Yau metric) when certain criteria are satisfied, but Yau's proof is non-constructive. It is not for lack of examples satisfying the criteria, since topological constructions ensure the existence of an exponentially large number of examples <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref>. The problem also does not prevent certain types of progress in string theory, since aspects of Calabi-Yau manifolds can be studied without knowing the metric. For instance, much is known about volumes of calibrated submanifolds <ref type="bibr">[7]</ref>, an artifact of supersymmetry and the existence of BPS objects, as well as metric deformations that preserve Ricci-flatness, the (in)famous moduli spaces <ref type="bibr">[8]</ref>. Nevertheless, the central problem persists: general geometric properties of manifolds, and associated physics arising from compactification, requires knowing the metric.</p><p>For this reason, it is interesting to study approximations of Calabi-Yau metrics. Efforts to do so generally require a sequence of approximations that converge toward the desired metric, which can be thought of as a flow in the space of metrics if the sequence is continuous. A classical example with a discrete sequence is Donaldson's algorithm, which uses a balanced metric to converge to the Calabi-Yau metric <ref type="bibr">[9]</ref>. More recently, there has been progress in approximating Calabi-Yau metrics with neural networks (NNs) <ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref>, which is the current state-of-the-art and is a continuous metric flow in the limit of infinitesimal update step size. Thirty minutes on a modern laptop gives an approximate Calabi-Yau metric on par with those that would take a decade to compute with Donaldson's algorithm.</p><p>The success of these NN techniques prompts a number of mathematical questions. What is the mathematical theory underlying flows in the space of metrics induced by continuous NN gradient descent?</p><p>Since empirical results suggest that the flows are converging to the Calabi-Yau metric, how does the flow relate to other flows that have Calabi-Yau metrics as fixed points, such as the Ricci flow?</p><p>The main result of this paper is to develop a theory of metric flows induced by NN gradient descent, utilizing a recent result from ML theory known as the neural tangent kernel (NTK) <ref type="bibr">[16,</ref><ref type="bibr">17]</ref>. We will derive the relevant flow equations and demonstrate that a metric-NTK governs the flow, which in general is non-local and evolves in time. However, many architectures <ref type="bibr">[18]</ref> admit a certain large-parameter limit <ref type="bibr">[16]</ref> in which the dynamics simplify and the metric-NTK becomes constant. Additional choices may be made to induce locality in the flow, and an appropriate choice of loss function reproduces Perelman's formulation of Ricci flow that was utilized to resolve the 3D Poincar&#233; conjecture.</p><p>However, we will see that the assumptions that realize Ricci flow seem both ad hoc and strong from a NN perspective, suggesting that it is very non-generic in the space of NN metric flows. We will situate it within the general theory we develop and demonstrate experimentally that related fixed kernel methods lead to suboptimal CY metric learning. Thus, NN metric flows provide a rich generalization of certain flows in the differential geometry literature, and the more general NNs-such as those of recent empirical successes-make weaker assumptions and outperform fixed kernel methods.</p><p>We will elaborate on this interplay extensively in the Conclusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Metric flows with NNs</head><p>In this section we will develop a theory of flows in the space of metrics under the assumption that it is represented by a NN g &#952; , where &#952; are the parameters of the NN and a flow in g &#952; is induced by a flow in &#952;.</p><p>For clarity, we summarize the results of this section. We will focus on the case that the parameters &#952; are updated by gradient descent with respect to a scalar loss functional L, and derive the associated flow equations in a 'time' parameter t. Without making further assumptions, the metric flow is non-local and is given by an integral differential equation that involves a t-dependent kernel. However, many NN architectures have a hyperparameter N such that the associated kernel becomes deterministic and t-independent in the limit N &#8594; &#8734;. This is a dramatic simplification of the dynamics, and we call such a metric flow an infinite NN metric flow; it is related to more traditional kernel methods, with the fixed kernel determined by the NN architecture. An infinite NN metric flow is still non-local and exhibits a certain type of mixing, but under additional assumptions about the architecture locality is achieved and mixing is eliminated, which we call a local NN metric flow. Such a flow is a local gradient flow, which allows metric flows defined as gradient flows to be realized in a NN context. In particular, we show that Perelman's formulation of Ricci flow as a gradient flow <ref type="bibr">[19]</ref> is a local NN metric flow. See figure <ref type="figure">1</ref> for a graphical summary of these various flows. We collect other types of gradient flows tailored towards K&#228;hler metrics in appendix B. Being gradient flows of scalar loss functionals, these flows are amenable to an analysis that parallels our discussion of Perelman's Ricci flow; indeed, Perelman's Ricci flow can be written in terms of energy functionals of K&#228;hler-Ricci flow <ref type="bibr">[20]</ref>, see appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">General theory and metric-NTK</head><p>Consider a Riemannian manifold X with metric g, given by g ij in some local coordinates. In this work we consider the case that g ij is represented by a deep NN, i.e. it is a function composed of simpler functions, with a set of parameters &#952; I , where I = 1, . . . , N; see appendix A. Generally N is large for modern NNs, and in this work we will exploit a different description that emerges in the N &#8594; &#8734; limit. Throughout, we use lower case Latin indices for the coordinates of X, and capital Latin indices for the network parameters. Written this way, the metric evolves as There are many mechanisms for training a NN in the deep learning literature, but one of the most common is gradient descent, in which case the parameter update is where L[g] is a scalar loss functional that depends on the metric. The loss sometimes depends on a finite set of points B on X known as the batch, in which case</p><p>where l[g](x &#8242; ) is a pointwise loss that still depends on the metric. Alternatively, the loss could be over all of X</p><p>where d&#181;(x &#8242; ) is a chosen measure on X. In typical machine learning applications the training data is fixed and drawn (implicitly) from a fixed data distribution, which in the continuous case corresponds to utilizing a fixed measure; in particular, below we will use a volume form with respect to a fixed reference metric g. Henceforth, we drop the [g] on L and l(x &#8242; ); it is to be understood that they depend on g. Either type of loss yields additional expressions for the metric flow induced by gradient descent. For the discrete and continuous case, we obtain</p><p>respectively. These expressions are written in a way to emphasize that one of the &#8706;g/&#8706;&#952; factors is outside of the sum or integral. Pulling them inside the sum or integral, we see the appearance of a distinguished object,</p><p>which is a type of NTK that we call the metric-NTK, which is symmetric in the first and last pair of indices.</p><p>In terms of the metric-NTK, the NN metric flows are</p><p>for the discrete batch and continuous batch, respectively. We emphasize that the metric-NTK &#920; ijkl is not a 4-tensor, since it depends on both x and x &#8242; : the (ij) indices transform with respect to diffeomorphisms of the external point x, and the (kl) indices transform with respect to diffeomorphisms of the integrated point x &#8242; , with invariance of metric updates under x &#8242; -diffeomorphisms requiring an invariant measure.</p><p>In general, this is a complicated metric flow. Some of the reasons include:</p><p>&#8226; Evolving Kernel. The metric flow is induced by gradient descent, which updates the parameters and therefore the metric-NTK; &#920; and &#948;l/&#948;g in (2.8) are time-dependent. &#8226; Non-locality. In general, &#920; induces non-local dynamics: it is a smearing function that updates the metric at x according to properties at other points y, which in principle could be far away from x. &#8226; Component Mixing. We see that any (k, l) components of the metric may update fixed (i, j) components of the metric via mixing of &#920; ijkl and &#948;l/&#948;g kl .</p><p>In the continuous case we have an integral differential equation that is difficult to solve and analyze. We will identify cases in which the situation simplifies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Infinite NN metric flows</head><p>In certain limits the metric-NTK enjoys properties that simplify the NN metric flow. We utilize the two central observations from the NTK literature <ref type="bibr">[16,</ref><ref type="bibr">17]</ref> and refer the reader to section 2.3.2 for details in an example; we review only the essentials here. First, for appropriately chosen architecture (including normalization), the metric-NTK in the N &#8594; &#8734; limit may be interpreted as an expectation value by the law of large numbers, in which case the associated integral over parameters renders it parameter-independent. Schematically, we have</p><p>where the bar over &#920; reminds us that the metric-NTK in this limit is an average (over parameters) of a tensor &#945; ijkl (x, x &#8242; ) that may be computed in examples. While &#920; is stochastic, due to its dependence on the parameters associated to some initial NN draw, &#920; is deterministic, depending only on the network architecture and parameter distribution.</p><p>The second simplification occurs for linearized models (linearized in the parameters, not the input x), defined to be</p><p>i.e. it is just the Taylor expansion in parameters, truncated at linear order. The metric-NTK associated to the linear model is</p><p>It is the metric-NTK &#920; associated to g, evaluated at initialization. That is, though &#920; evolves in t, &#920; L does not. Taking the N &#8594; &#8734; limit of the linearized model, its metric-NTK &#920;L is both t-independent and &#952;-independent. &#920;L is the so-called 'frozen' NTK. Though linearization may seem like a violent truncation, it has been shown that deep NNs evolve as linear models <ref type="bibr">[17]</ref> in the N &#8594; &#8734; limit, i.e. &#920;L governs its gradient descent dynamics to a controllable approximation. This regime is known as 'lazy learning.' For this reason, we henceforth drop the superscript L and write the frozen NTK as &#920;. In general, any quantity with a bar is frozen, i.e. deterministic and t-independent.</p><p>In summary, in the frozen-NTK limit the general NN metric flow (2.8) becomes</p><p>(2.12)</p><p>In the discrete and continuous case, with the only difference being that the dynamics are governed by the deterministic t-independent kernel &#920; rather than the generally stochastic t-dependent kernel &#920;. The dynamics still exhibits non-locality and component mixing, but no longer has an evolving kernel. We refer to such a flow as an infinite NN metric flow.</p><p>Though the equations have only changed by the introduction of the bar, this is actually a dramatic simplification of the dynamics! Naively, one might think that although the kernel is better behaved, one cannot train an infinite NN because one cannot put it on a computer. This is in fact not true: while initializing an infinite NN and running parameter space gradient descent is indeed impossible, the large-N limit gives a new description of the system-a different duality frame-that allows to analyze the same dynamics in a different way that does not require explicitly updating an infinite number of a parameters. For instance, for mean-squared-error loss the dynamics may be solved exactly, with computable mean and covariance across different initializations <ref type="bibr">[17]</ref>. This is the sort of behavior expected of duality: by describing the same system in a different way, here as kernel regression on function space rather than NN gradient descent in parameter space, new calculations become possible that would be impossible in the original description. In the mentioned MSE case, the calculations amount to the average prediction and covariance of an infinite number of infinitely wide NNs trained to infinite time; clearly this is infeasible in the infinite dimensional parameter space description!</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Local metric flows</head><p>Let us discuss how to simplify the flows even further by getting rid of non-locality and component mixing. While such flows are less general, they are of interest because they are more analytically tractable, and we will also see that they recover some famous metric flows.</p><p>To simplify matters, we will consider local metric flows that do not exhibit component mixing, though the latter could be relaxed. Specifically, if the discrete and continuous frozen metric-NTK satisfy</p><p>respectively, for some deterministic function &#937;(x), the dynamics becomes</p><p>Here &#948; x,x &#8242; is the discrete Kronecker delta and &#948;(x -x &#8242; ) is the Dirac delta function. We call such a flow a local metric flow. However, there is a pathology that must be discussed. In deep learning, one typically splits the input into disjoint sets containing train inputs x &#8242; and test inputs x, respectively. NN updates are trained on train inputs x &#8242; only and their generalization properties to unseen points is tested with the test set inputs x. In the continuous case, there simply is no disjoint test set since we integrate over the entire manifold X. In the discrete case, enforcing delta-functions in the frozen metric NTK in (2.13) means that only train points are updated and there would be no learning on any point outside the train set. Hence, a non-local metric-NTK is essential to ensuring that the metric gets updated at all manifold points. Thus, the local case (2.14) only makes sense for continuous flows.</p><p>Finally, there is a simple trick for defining a new architecture that gets rid of &#937;(x) or reshapes it, if desired. Such a choice would remove the &#937;(x) dependence of the architecture, but the locality and component mixing choices of the architecture remain intact. We will phrase the trick in terms of a general NN, but apply it in the context of metrics.</p><p>Let &#981;(x) be any network function with associated NTK &#920; &#981; (x, x &#8242; ). Now multiply &#981;(x) by a deterministic function h(x) to obtain a new network</p><p>(2.15)</p><p>Since h(x) is deterministic, &#981; and &#981; have the same parameters and</p><p>This also gives the same result for frozen NTKs,</p><p>In the case of a local flow, where we have</p><p>(2.17)</p><p>The &#948;-function imposes that the two h's are evaluated at the same point, yielding</p><p>This trick is useful because we may use the freedom of h(x) to define a new architecture &#981; whose NTK we can shape, and in particular cancel out potentially unwanted local factors such as &#8486;(x) in the NTK or g(x &#8242; ) in the volume form.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.1.">Architecture design for local metric flow</head><p>We have seen that a metric-NTK satisfying (2.13) induces a local metric flow (2.14) that evolves with a deterministic kernel, a local evolution equation, and without mixing induced by &#920; ijkl and &#948;l/&#948;g kl . In this section, we study whether NN architectures that satisfy (2.13) actually exist. Specifically, we investigate how the different delta functions may arise in the metric-NTK.</p><p>There is a simple sufficient condition for obtaining the Kronecker deltas &#948; ik &#948; jl<ref type="foot">foot_0</ref> . Choosing each independent component of the metric g ij (x) to be a separate NN with parameters &#952; ij , the set of parameters &#952; for the entire metric is partitioned as</p><p>and the metric-NTK is</p><p>with no Einstein summation over lower-case Latin indices on the RHS. The last pair of derivatives is the NTK</p><p>Thus, the metric-NTK for a metric with components given by independent NNs evolves according to the NTK of the individual components, as expected. By symmetry, we should choose the independent NNs associated to each component to have identical architecture. Since the architectures for the components are the same, they have the same frozen NTK,</p><p>and we have that the frozen metric-NTK is</p><p>This gets us part of the way to (2.13); we have the Kronecker deltas that prevent component-mixing.</p><p>We still need the Dirac delta function that induces locality, however. This is a non-trivial step. Given (2.23), we must find an architecture such that</p><p>for some &#937;(x). The &#948;-function is defined with respect to the measure d&#181;(x &#8242; ), which for simplicity we take to be the volume form with respect to a fixed reference metric g on X, although the same argument applies for a more general probability density d&#181;(</p><p>which gives the usual identity &#180;dd</p><p>We take a two-step process to obtain (2.24): we must figure out how to obtain &#948; R d inside an NTK, and then how to account for the factor |g| in the volume measure. Let &#981; &#963; (x) with &#963; &#8712; R + be a network with frozen NTK &#920;&#963; (x, x &#8242; ) given by</p><p>for some deterministic function &#8113;(x, x &#8242; ). We emphasize that the semi-locality induced by the Gaussian suppression is not general, and is another assumption that must be made to push towards local flows and</p><p>(eventually) Perelman's formulation of Ricci flow. Clearly, we get the &#948; R d inside the NTK for &#963; &#8594; 0. Using the rescaling trick introduced in the previous section, we may define a new network function that is &#981; &#963; (x) multiplied by g(x) -1/4 . This gives a new NTK that by abuse of notation we again call &#920;&#963; , given by</p><p>where the only difference is the g-factors, by design. This new NTK satisfies</p><p>i.e. it satisfies (2.24) with &#937;(x) = &#8113;(x, x), where this &#948;-function is with respect to the non-trivial volume measure. The parameter &#963; &#8712; R + clearly sets a scale of non-locality in the network evolution, which is an interesting object to study. The family has an NTK given by (2.28), which induces a metric flow</p><p>Here we see that the variance of the Gaussian &#963; is a parameter controlling the amount of non-locality in the dynamics, since it affects how strongly updates at x &#8242; affects the evolution of the metric at x. In the limit &#963; &#8594; 0, the normalized Gaussian becomes the &#948;-function and the dynamics is</p><p>where &#937;(x) = &#8113;(x, x), and we have recovered a local metric flow. Summarizing, we obtain a network with the desired properties as follows:</p><p>&#8226; No component mixing between &#920; ijkl and &#948;l/&#948;g kl is obtained by taking each component to be a NN of the same architecture, but with independent parameters. &#8226; Locality arises from a &#948;-function as in (2.24), which we achieve by first obtaining a &#948; R d in the NTK as the limit of a normalized Gaussian, and then passing to the &#948;-function on the Riemannian manifold by using the rescaling trick to obtain the geometric factor in the volume form.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.2.">Concrete architectures for local and non-local metric flows</head><p>We have just shown that any family of architectures with parameter &#963; satisfying (2.27) may be rescaled to give a local metric flow as &#963; &#8594; 0. Since we are considering cases where each metric component is independent, it suffices to consider scalar network functions. Consider a network architecture based on cosine activation, with</p><p>where N is the width of the network and A is a normalization factor that will be fixed later. To simplify the picture, we freeze all weights to their initialization values except the a i , in which case the NTK is</p><p>As we take the width N &#8594; &#8734;, the NTK &#920; becomes an expectation value, by the law of large numbers,</p><p>Evaluating the expectation gives</p><p>We see that our NTK is a Gaussian, as desired. Normalizing it, so that in an appropriate limit it is a &#948;-function, we have (with</p><p>With this normalization factor fixed, the NTK is a normalized Gaussian of width</p><p>We obtain a local metric flow by taking the limit &#963; w &#8594; &#8734;, which sends &#963; &#8594; 0. The trick of freezing all weights but those of the last (linear) layer is common in the literature. In such a case the NTK is determined (up to a constant) by the so-called NNGP kernel E[&#981;(x)&#981;(x &#8242; )], also known as the two-point correlation function. The architecture that we have presented is a simple modification (via so-called NTK parameterization, which gives the 1/ &#8730; N factor) of an architecture in <ref type="bibr">[21]</ref> that is known to have a Gaussian NNGP kernel. Similarly, the architecture called Gauss-net in <ref type="bibr">[22]</ref> has a Gaussian NNGP kernel, and can be appropriately modified to give a different architecture (based on exponentials in the network function, rather than cosines) that yields a local metric flow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Ricci flow and more general flows as NN metric flows</head><p>Having developed a theory of NN metric flows, we ask: are any canonical metric flows from differential geometry realized as NN metric flows?</p><p>Many well-studied metric flows are local: they are not an integral PDE and take the form</p><p>with metric updates depending on the local properties of some rank two symmetric tensor u ij (x). A particularly well-studied example is Ricci flow</p><p>In general, flows of the form (2.38) are not necessarily gradient flows. To make a direct connection, one could study NN metric flows that are not gradient flows either, but following a standard assumption in deep learning we studied those NN metric flows that are induced by the gradient of a scalar loss functional. Famously, Perelman showed <ref type="bibr">[19]</ref> that Ricci flow is a gradient flow after applying a t-dependent diffeomorphism (see <ref type="bibr">[23]</ref> for a review). Specifically,</p><p>where</p><p>.41) Is the Perelman functional and (2.40) is equivalent to (2.39) by a t-dependent diffeomorphism, and &#981; is a scalar function that is known as the dilaton in string theory. Using our formulation, local metric flow induced by NN gradient descent, e.g. via the concrete architecture of section (2.3.2), we obtain Perelman's formulation of Ricci flow by simply choosing the loss function appropriately,</p><p>Similarly, we obtain a parametrically non-local generalization of Ricci flow by backing off of the &#963; &#8594; 0 limit. Doing so replaces the &#948;-function in the NTK by a Gaussian, and allows nearby points y to influence metric updates at x, in a way controllable by tuning &#963;.</p><p>Similarly, any other local metric flow that is a gradient flow may be achieved, together with its non-local generalization for finite &#963;, by the choice of loss function and a tuning of &#963;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Numerical implementation</head><p>In this section we describe the use of these (kernel) methods to approximate CY metrics numerically. We will use the standard test case of the Fermat Quintic to benchmark the methods. The Fermat Quintic is a Calabi-Yau threefold that is given as the anti-canonical hypersurface inside a complex projective space P 4 with homogeneous coordinates z i by the equation</p><p>A simple NN with three layers, 64 nodes, and ReLU activation function can pick up a factor of 10-100 in the overall validation loss, when trained for around 50 epochs on 100k points on the Fermat Quintic <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>. This factor serves as a natural benchmark for comparison with our kernel methods.</p><p>The main takeaway is that kernel methods work excellently on the train set, but generalization to test points is challenging. One reason is that there are multiple numerical challenges associated with the implementation. We will describe how we overcome these by implementing feature clustering. We then evaluate performance using ReLU, Gaussian, and semi-local kernels. In all cases, test set performance lacks behind that of simple (finite width) NNs. We attribute this lack of performance of frozen metric NTK methods to the importance of feature learning, which is impossible for frozen NTKs. Indeed, feature learning necessitates that the kernel corresponding to the NN adapts to features, i.e. evolves during training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Clustering and preprocessing</head><p>We study the discrete NTK &#920; ijkl (x a , x &#8242; b ), where i, j, k, l &#8712; {1, 2, 3}. The index b always runs over the N &#8242; pts train points, b &#8712; {1, . . . , N &#8242; pts }, whereas a runs over the same set during clustering and training, but over labeling the N pts test point during prediction / inference with a &#8712; {1, . . . , N pts }. During training, when we use all N &#8242; pts points, this is a 3</p><p>pts tensor, which has almost a trillion entries for N &#8242; pts = 100 000. This is a problem generic to NTK implementations <ref type="bibr">[24]</ref><ref type="bibr">[25]</ref><ref type="bibr">[26]</ref> and means that hardware far beyond academic grade is required for the full computation.</p><p>Instead of computing the full trillion-parameter tensor, we cluster the input features. For the problem at hand, it might be reasonable to assume that the metric at some test point x &#8242; is most strongly correlated with the metric at nearby points x. Typically, there is no canonical metric on feature space, so the notion of distance is somewhat arbitrary. In our case, however, the features are actual points on a CY manifold, so a good notion of distance would be the shortest geodesic distance between these points with respect to the canonical (unique, Ricci-flat) metric. In practice, this is unfeasible for multiple reasons: First, we do not yet have the Ricci-flat CY metric-we are computing it with the NTK. Second, even if we had the metric, computing the geodesic distance between two points requires solving an initial value PDE numerically (which can be done using e.g. shooting methods), but is very computationally costly.</p><p>Hence, we resort to a much simpler proxy which is less accurate but cheap to compute: the shortest geodesic distance between two points in the ambient space Fubini-Study metric. The Fubini-Study metric is a canonical K&#228;hler metric on complex projective spaces, see e.g. <ref type="bibr">[27]</ref> for a mathematics introduction to the topic. For two points z and z &#8242; in P n given in terms of homogeneous coordinates, the geodesic distance with respect to the Fubini-Study metric is</p><p>where the dot product is with respect to the flat metric.</p><p>In order to perform clustering based on distances for N &#8242; pts = 100 000 points, we would need to fit a 100 000 &#215; 100 000 distance matrix into memory, which is also unfeasible. Instead, we implement the following procedure:</p><p>1. Divide the points into mini batches. The size of these should be adapted to the available hardware. We used a size of M = 10 000. 2. Cluster each mini batch independently based on the FS geodesic distance. We use agglomerative clustering and average to compute the linkage of clusters, as implemented by sklearn <ref type="bibr">[28]</ref>. Again the total cluster size depends on the available hardware. We aimed for O(5000) points in each cluster, which allowed us to compute the 3 &#215; 3 &#215; 3 &#215; 3 &#215; 5000 &#215; 5000 NTK tensor, so we aimed for C = N &#8242; pts /5000 clusters.</p><p>At this point, we have B = N &#8242; pts /M batches (which we label by I = 1, 2, . . . B), each of which has C clusters c I &#945; , &#945; = 1, 2, . . . C. Next, we need to merge the clusters across the different mini batches. To do so, we proceed as follows: 3. Use the clusters of the first mini batch as a starting point 4. For each cluster in mini batch 2, compute the geodesic distance between all points in this mini batch and the first cluster. The cluster from mini batch 2, whose mean distance over all points in cluster 1 of mini batch 1 is smallest, gets merged into cluster 1 of mini batch 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Continue for all clusters in all mini batches.</head><p>This means that the final clusters c &#945; are given by</p><p>Based on the clusters, we can compute the NTK updates for each cluster separately, assuming that the contribution from other clusters can be neglected in the updates. While this algorithm is not perfect and depends on the random order of mini batches and clusters, it produces decent results as can be checked by comparing the result of this algorithm to the result when we just cluster without batching, which is feasible if the number of points is small enough. To illustrate the efficiency of the algorithm we present a heatmap of the distance matrix for 1000 points in figure <ref type="figure">2</ref>. On the left, we see the distance matrix without any clustering.</p><p>In the middle, we give the distance matrix when ordering the points when clustered according to (3.3) with 5 batches. On the right, we give the distance matrix when running the clustering algorithm on all 1000 points (i.e. without batching). To quantify the quality of the clustering algorithm we also computed the silhouette score <ref type="bibr">[29]</ref> using scikit-learn <ref type="bibr">[28]</ref>. The silhouette score is a number between -1 and 1, with 1 indicating perfect clustering and values around 0 indicating overlapping clusters. We compute the silhouette score for the batched clustering with 5 batches, for the full clustering without batching, and for Birch clustering based on the Euclidean-rather than Fubini-Study-distance as a baseline (the result looks like no clustering to the naked eye). We obtain silhouette scores of .016, -.011, and -0.032 for the three cases, respectively. There are further numerical subtleties that need to be taken into account. First, the input features are the real and imaginary part of the 5 homogeneous coordinates z i &#8712; P 4 , i = 0, . . . , 4. In contrast, the CY metric is naturally expressed as a Hermitian 3 &#215; 3 matrix, g a b dx a (z) &#8855; dx b (z), a, b = 1, 2, 3. Here, the three complex coordinates x a are functions of the five homogeneous ambient space coordinates z i . Typically, one chooses a coordinate system on the CY by going to affine coordinates in a patch of the ambient space (this removes one of the five z i coordinates through scaling) and eliminating one of the affine coordinates via the hypersurface equation that defines the CY in that patch. While physics does of course not depend on the choice of coordinate system, the metric g a b will get modified by the Jacobian of the transformation from one coordinate system to another.</p><p>Since computing the relevant data like the holomorphic 3-form &#8486;, the pullbacks, and the integration weights is prone to numerical errors, we use this freedom to choose patches and scalings that minimize the numerical error sources. In practice, this amounts to transforming to the patch where the coordinate z i with the largest absolute value is scaled to one, and then using the CY equation to eliminate the coordinate which results in the smallest value for |&#8486;| 2 . However, this means that there are 20 possible coordinate choices and the metric at different points will be in different coordinate systems. Hence, updating the matrix g a b(y) using the matrix entries of g i j(x) does not make any sense. One way around this issue is to transform the metric at each points into the same coordinate system. instead of doing this, we simply sample about 20 times more points than needed and then use rejection sampling to obtain a set of points whose metrics are in the same coordinate system under the above procedure. Another problem is a numerical instability related to the geodesic distance. Note that</p><p>when 0 &lt; &#1013; &#8810; 1. This means that points x and y which are numerically identical (meaning &#1013; &#8764; 10 -7 ) would still have a sizable distance (.0003). Since our kernel exponentially suppresses updates at x by d(x, y), this leads to unwanted behavior during training. To ameliorate this, we set the geodesic distance to zero once it falls below a certain threshold, since we would only be amplifying numerical noise otherwise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Metrics from kernels</head><p>In this subsection we summarize a number of techniques that we used to approximate metrics on test points using kernels. We compute the metric updates for a discretized flow of (2.8), where we using first order backwards differentiation to compute the (n + 1)st approximation for the metric using</p><p>where we treat the step size &#8710;t as a hyperparameter. For the NTK &#920; ijkl (x, x &#8242; ), we use different kernels, either computed from the neural-tangents package <ref type="bibr">[24]</ref> for a NN with ReLU activation function, or taken to be a fixed kernel (we study a Gaussian Kernel and a 'delta function' or 'nearest neighbor' kernel) which are not necessarily induced by a (simple or obvious) NN architecture. For the loss &#8467;(x &#8242; ), we use the sigma loss defined in (B.11). To determine the hyperparameters (such as the learning rate &#8710;t, the width of the Gaussian for the Gaussian kernel, etc), we use a Python implementation <ref type="bibr">[30]</ref> of a Bayesian optimization scheme <ref type="bibr">[31]</ref>.</p><p>We need to generate a new set of points for each Bayesian optimization run since we observed that otherwise the optimizer tunes the hyperparameters to the specific point set and shows poor performance on new data.</p><p>For the optimal set of hyperparameters identified by the optimizer, we iterate (3.5) for 50 steps. After each iteration in (3.5), we compute the sigma loss on the test set for the new metric g (n+1) ij .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1.">Metrics from a ReLU NTK</head><p>A simple single-layer network with ReLU activation may pickle up a factor of 10 to 100 in the total loss. We use this as a baseline of comparison for kernel methods, and specifically study the NTK associated to this ReLU network as computed by the python package neural-tangents <ref type="bibr">[24]</ref>, which has the ability to compute the infinite-width NTK of many architectures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2.">Metrics from a Gaussian kernel</head><p>We also use a simple Gaussian kernel, i.e. we set</p><p>where the variance &#963; is a hyperparameter and d(x, x &#8242; ) is the shortest geodesic distance w.r.t. the ambient space FS metric. We recover the local update limit for &#963; &#8594; 0. In the finite &#963; regime, the kernel updates at a point x &#8242; are just the sums of all contributions from all training points x, weighted by a Gaussian proportional to their distance to x &#8242; . This suppression means that if the distance d is larger than &#963;, the updates are tiny. For this reason, we include the possibility of normalizing the updates such that they are sizable at least at some points. Overall, we found that the optimizer prefers a narrow Gaussians. However, the performance is very sensitive to the hyperparameters and can easily lead to exploding gradients or metric updates that lead to non-positive definite matrices. When restricting the updates akin to gradient clipping, the performance suffers substantially.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.3.">Metrics from a Delta kernel</head><p>Given that the Bayesian optimizer prefers narrow Gaussians, we finally study a kernel which simply updates the metric at a test point x &#8242; with the metric correction computed for the closest point x in the training set. We call this the Delta kernel.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.4.">Results and discussion</head><p>With the techniques that we have described, we performed many experiments with different values of the various hyperparameters. Though we were able to obtain significant gains in the train loss, our experiments failed to achieve a gain of more than a factor of &#8764;5 in the test loss. We would like to streamline the presentation of our results to help understand this failure to generalize to test points. To do so, we need to compute the drop in the loss as a function of the geodesic distance between the test and train points. In practice, it requires sophisticated sampling techniques to produce a sample of test points at some fixed geodesic distance from the training set. For our purposes, instead of implementing such sampling, we approximate the same effect by adding Gaussian noise with varying variance to each training point. Hence the test points will only be 'approximately' on the CY. The analysis effectively allows us to interpolate between testing on train points and testing on test points as a function of the variance of the Gaussian noise. The factor gained in the &#963;-loss versus average shortest geodesic distance between noised and sampled points is presented in the left plot of figure <ref type="figure">3</ref>; in the right plot, we show how the maximum, mean, and minimum geodesic distance changes with the total number of sampled points. The experiments were run with hyperparameters given in table <ref type="table">1</ref>.</p><p>From figure <ref type="figure">3</ref> we see a number of results. First, as the geodesic distance approaches zero, i.e. when using the train set as the test set, we see that the Gaussian and Delta kernels achieve a factor of &#8764;10 6 in the &#963;-loss, even though there are only &#8764;1000 train points; the ReLU kernel achieves a factor of &#8764;5. However, as noise is added the test point become further away from the train points, corresponding to a growing geodesic distance. As the geodesic distance increases, the gain factor falls off, until the gain factor is only &#8764;1 for distance 1. This indicates that when the test points are too far from the train points, the kernel methods fail to generalize. From the right plot in figure <ref type="figure">3</ref>, we see that the mean geodesic distance &#181; between points (and therefore train and test points, since they are sampled fairly from the Shiffman-Zelditch measure) is about .3 and .1 for 10 3 and 10 5 points, respectively. Comparing these distances to the plot on the left in figure <ref type="figure">3</ref>, we see that with geodesic distances in this range we expect the gain factor to be about 5, regardless of whether you train at 10 3 points or 10 5 points. We confirmed this with direct experimentation: at 10 3 and 10 5 points, the best our experiments did on test loss was about a factor of 5.</p><p>The advantage of the noise variance analysis is that it helps us understand what gains we might hope for in the test loss as we scale up the number of points. From the right plot in figure <ref type="figure">3</ref> we see a linear dependence in the mean geodesic distance on a log-log scale. From the best linear fit to the data we obtain &#181; = .968 &#215; N -0.184 pts .</p><p>(</p><p>Note that on theoretical grounds, one could expect &#181; to be proportional to (N pts ) -1/d for a point sample on a (real) d-dimensional manifold. In our case, this would lead to an exponent of -0.167, which is close to the fitted value. The prefactor is more complicated and determining it analytically would require solving a sphere-packing type problem on the CY with respect to the geodesic distance using the FS metric and the Shiffman-Zelditch distribution. We just note that the mean line segment length between points on a 6d unit hypercube is 0.969, which is numerically very close to the prefactor from our fit. Extrapolating, to obtain the gain factor &#8764;100 achieved with finite NNs requires a mean geodesic distance of 5 &#215; 10 -3 for the Gaussian kernel, which in turn requires 10 12 points. For the Delta kernel, a gain factor of &#8764;100 requires a mean geodesic distance of 5 &#215; 10 -2 , which in turn requires 10 8 points. Since the finite NN achieves this accuracy by learning from 10 5 points, it is significantly more data efficient than the kernel methods. To obtain the gain factor of &#8764;10 6 we observe for the training set also for the test points, requires &#8764;10 35 points.</p><p>Discussion of the Importance of Feature Learning. The fact that the kernel methods fail to generalize well for larger distances is perhaps not too surprising. The NTK is fixed purely by the architecture and the parameter initialization. It is set in stone before training even starts and does not change ever. This means that, unlike a finite NN that can dynamically adapt to its inputs and learn useful embeddings for regression, the fixed kernel dictates how results at training points x are combined to form a prediction at test points x &#8242; .</p><p>The fixed kernel methods seem at odds with the fact that the Calabi-Yau metric is unique in a fixed K&#228;hler class. If a kernel method is to be used to predict the metric at x &#8242; given nearby metric information, it probably has to be a quite special kernel. Its analytic expression is likely a very complicated function and not a simple Gaussian or the NTK of a ReLU. In the only case where we know the CY metric, which is the trivial case of a two-torus, we can compute this function. For the analog of the Fermat Cubic, which is the 1 d analog of the Fermat Quintic in (3.1), this function is given in terms of inverse Weierstrass &#8472;-functions <ref type="bibr">[32]</ref>, which is certainly not reproduced by adding metrics at nearby points weighted by a Gaussian or any natural guess for a kernel function. This makes the fact that the finite NN performs so well even more impressive.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Calabi-Yau metric learning and NTK evolution</head><p>Our numerical results demonstrate that the kernel learning associated to a frozen-NTK infinite width limit is not sufficient to learn the Calabi-Yau metric with a reasonable number of train points. This is in contrast to the finite width NNs of previous works, which can learn the metric efficiently, suggesting that feature learning is crucial for learning the metric.</p><p>We can quantify this phenomenon by studying the evolution of the empirical metric-NTK during training: if the metric-NTK is frozen, features are not being learned. To quantify how much the NTK evolves during training, we study the so-called multiplicative model of the cymetric package <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>, which approximates the metric components g ij using an MLP. We choose an architecture with three hidden layers with 128 neurons each and gelu activation function and train the model for 50 epochs. We compute the empirical NTK &#920; ijkl (x, x &#8242; ) of this model after each epoch for 10 randomly selected (but fixed) train set points x and 10 randomly selected (but fixed) test set points x &#8242; . Since a hermitian d &#215; d matrix has d 2 real degrees of freedom, we combine the indices (ij) of the independent components into a multi-index I and (kl) into a multi-index J, with 1 &#10877; I, J &#10877; 9. The NTK is thus a 9 &#215; 9 &#215; 10 &#215; 10 matrix. As a summary statistic, we compute the eigenvalues of the 10 &#215; 10 matrix for each of the 9 &#215; 9 entries (I, J) and track their (combined over all (I, J)) evolution over the course of training. The results are presented in figure <ref type="figure">4</ref>. We see that the eigenvalue spectrum of the metric-NTK evolves significantly during training, with a clear distribution shift between the spectra at epoch 0 and epoch 50. To quantify the shift in the distributions, we track the Wasserstein distance between the trained spectrum and the initial spectrum during training, showing significant evolution. Together, these plots demonstrate that the metric-NTK evolves significantly during training, and that the feature learning is crucial for the success of the NN in learning the metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions</head><p>We develop a theory for metric flows induced by NN gradient descent. In general, the evolution of a NN that describes a metric is a complicated function of its hyperparameters, its parameters at initialization, and it evolves over the course of the training process. In the infinite width limit, this complicated training process becomes tractable and can even be described analytically in some cases.</p><p>We illustrated how NNs can be engineered to reproduce flows with specific properties, such as Perelman's Ricci flow. The concept generalizes to other kinds of flows that are gradient flows of some energy functional, as reviewed in appendix B. This allows to engineer metric flows that are well-studied in the mathematics literature in terms of NNs. Conversely, one can ask which type of flow a NN with a specific choice of architecture induces. Beyond that, one can study flows from kernel methods that are not the NTK of any known NN, leading to more general kernel methods.</p><p>Numerically, we encounter multiple challenges. The most important one is that kernel methods require huge matrices that exceed the available computational resources of most users. To ameliorate this, we develop a batched feature clustering algorithm based on the geodesic distance in feature space. In general, there is no canonical metric for feature space, or even a natural choice of coordinates for the input data manifold. However, for the case at hand, the input data manifold is just the manifold for which we want to study the metric flow, such that there is a canonical choice of coordinates and reference metric in this coordinate system, with respect to which geodesic distances can be computed.</p><p>For the test case of the Fermat Quintic, we observe that the metric flow on the training set leads to almost Ricci-flat metrics. However, these results fail to generalize well for the kernels we tested. These included both NTKs as well as other kernels that are not necessarily derived from a NN. We attribute the fact that the results are worse than those obtained from a simple finite NN to the importance of feature learning. Kernel methods where the kernel is fixed during training cannot learn features. Conversely, we showed that finite NNs that learn metrics well have metric-NTKs that evolve significantly in time.</p><p>The failures of infinite-NN kernel methods as compared to the performance of their finite counterparts is likely a consequence of the Calabi-Yau theorem: since the Calabi-Yau metric is unique, kernels that make correct predictions for the metric at x &#8242; given nearby metric information must be very special. A priori such kernels would have nothing to do with the kernels associated to randomly initialized infinitely wide NNs. Instead, for NN kernel methods to work, the kernel should be learned so that it falls into this distinguished subclass. This is precisely what the finite NNs achieve.</p><p>We end with some discussion of the main points. We have developed a theory of metric flows induced by NN gradient descent, which is the main result.</p><p>However, the reader may find that the assumptions that lead to a realization of Perelman's Ricci flow-infinite-width limit, locality, and elimination of component mixing-are both ad hoc and strong. Indeed, this is the point: this rather famous metric flow arises as one instance in the much broader context afforded by NN metric flows. Thus, the analytic tractability that was crucial for proving the Poincar&#233; conjecture may correlate with specificity that is suboptimal for learning Calabi-Yau metrics efficiently. Conversely, the recent empirical successes of NN approximations of Calabi-Yau metrics do not make any of the assumptions that led to Perelman's Ricci flow (or other fixed kernel methods). The relative relationship between these good CY metric flows and Ricci flow is depicted in figure <ref type="figure">1</ref>, and begs the related question: are the infinite-width techniques also good for learning Calabi-Yau metrics?</p><p>Our results show that the answer is no, as might have also been expected on ML theory grounds (though there are some regimes in which kernel methods compete well with NNs <ref type="bibr">[33]</ref>). Specifically, we conduct experiments to approximate Calabi-Yau metrics with kernel methods of a similar type to those required for Ricci flow, and find that with a fixed number of points sampled on the Calabi-Yau, finite NNs lead to far better performance. The key idea is that the fixed kernel do not learn features or evolve. A direct numerical analysis <ref type="bibr">[12]</ref> with Ricci flow also does not perform as well (beyond the torus) as finite NN approaches. To make the converse point, we perform a finite NN CY metric experiment that demonstrates the evolution of its metric-NTK.</p><p>In summary: we develop a general theory of metric flow induced by NN gradient descent, and show the Ricci flow can arise in this context but only after making a number of very strong assumptions that lead to learning with a fixed kernel. This suggests that the power of NNs to learn Calabi-Yau metrics rests on their ability to learn features, explaining recent successes in the field.</p><p>where L is a scalar loss functional that depends on f, and therefore on the parameters &#952;. We have written the continuum time equation, but in practice discrete time steps are taken on a computer. Typically L is a sum of an element-wise loss &#8467;</p><p>where B is a set of data points known as the batch. A simple variant of GD is stochastic GD (SGD), which at a fixed time step t during training applies GD to a random (hence, stochastic) proper subset B t &#8834; B known as the mini-batch. While GD gets stuck when &#952; is at a critical point of the loss function, SGD often escapes the critical point due to the stochastic nature of the mini-batch.</p><p>Clearly this brief introduction only scratches the surface, but it should aid in reading this paper. For the reader interested in a thorough introduction to NNs, we recommend <ref type="bibr">[40,</ref><ref type="bibr">41]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix B. K&#228;hler-Ricci flows for string theory</head><p>It has been recognized early on that a Ricci-flat manifold with external Minkowski space and a constant dilaton provide a solution to the supergravity equations of motion from string theory <ref type="bibr">[42]</ref>. Hence, Perelman's or Hamilton's Ricci flow can be used to find consistent background metrics for string compactifications. If there is no additional structure to the compactification space, and we are just interested in finding a Ricci-flat metric in a given topological class (e.g. for the case of M-theory compactifications on G 2 manifolds), this seems like a promising avenue.</p><p>In the case of compactifications on Calabi-Yau manifolds, we can make use of the fact that the metric is K&#228;hler, which means it can be obtained from taking a holomorphic and an anti-holomorphic derivative of a real quantity K called the K&#228;hler potential <ref type="foot">5</ref> ,</p><p>subject to the constraint that the metric is positive definite. We can use this metric to define a closed (1,1)-form, the so-called K&#228;hler form J, via We immediately see that the metric update for K&#228;hler-Ricci flows is &#8706;-exact and &#8706;-exact, and therefore the K&#228;hler class is fixed under K&#228;hler-Ricci flow. We can further exploit the fact that the metric and all quantities that are derived from it are controlled by the K and formulate the metric flow on the level of the K&#228;hler potential, i.e. on the space<ref type="foot">foot_2</ref> </p><p>Here J 0 is a reference K&#228;hler form (derived from a reference K&#228;hler metric) which specifies the K&#228;hler class of the metric we are interested in and is assumed to be constant throughout the flow. This means that the metric flow of</p><p>is induced by the flow of the K&#228;hler potential correction &#966;(t). This allows us to recast the problem of finding a Ricci-flat metric into the problem of solving a partial differential equation of Monge-Ampere type, which is what Yau used to prove Calabi's conjecture <ref type="bibr">[2,</ref><ref type="bibr">3]</ref>. On the level of the flow, this leads to a parabolic flow equation in &#966;.</p><p>Just as in the case of the Perelman functional, whose functional variation gives rise to Hamilton flow (after a time-dependent diffeomorphism), there are several related functionals for flows in &#966;: the Mabuchi energy functional <ref type="bibr">[43]</ref>, the J-functional <ref type="bibr">[44,</ref><ref type="bibr">45]</ref>, and the Calabi functional <ref type="bibr">[46]</ref>. The J-functional is the second term of the Mabuchi energy functional and the Calabi functional is the square of the derivative of the J-functional <ref type="bibr">[47]</ref>. We refer the reader to <ref type="bibr">[48]</ref><ref type="bibr">[49]</ref><ref type="bibr">[50]</ref> for more in-depth discussions of these quantities, and to <ref type="bibr">[20]</ref> for a nice overview and application to CY metrics.</p><p>The Calabi functional reads</p><p>where R is the scalar curvature and J n is the integral measure. The &#966;-flows relevant for us has already been discussed in <ref type="bibr">[20]</ref>, so we merely review it here following their exposition. As relevant background for utilizing their results, recall that a trivial anti-canonical bundle means that there exists a nowhere-vanishing holomorphic top form &#8486;. In contrast to the Ricci-flat metric, analytic expressions for this (n, 0)-form are known and can be computed straight-forwardly for Calabi-Yau manifolds that are given as complete intersections in toric ambient spaces <ref type="bibr">[51,</ref><ref type="bibr">52]</ref>. From &#8486;, one can construct a volume form &#181; CY = (-i) n &#8486; &#8743; &#937;. Since h n,n = 1 on a Calabi-Yau, this volume form is uniquely defined up to a constant. This means that &#181; J = J n /n!, which is also an (n, n)-form, must be proportional to |&#8486;| 2 , J n = &#954;|&#8486;| 2 .</p><p>(B.10)</p><p>In the literature, it is customary to define the sigma loss This flow ends on the CY metric, for which &#951; = 1.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0"><p>These Kronecker deltas amount to a choice of gauge that is not preserved under diffeomorphisms in x and x &#8242; on the metric-NTK.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1"><p>The K&#228;hler potential is only determined up to K&#228;hler transformations, which means that K is the section of a line bundle.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2"><p>We want to point out two potential points of confusion that are owed to using standard symbols for quantities. The quantity &#966; is not the same as the dilaton &#981;. Likewise, the holomorphic top form &#8486; that will appear later is not related to the &#8486; appearing in the NTK kernel, and the loss function &#963; (which is unrelated to a NN activation) is not related to any Gaussian width or noise variance.</p></note>
		</body>
		</text>
</TEI>
