<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Operator Learning Using Random Features: A Tool for Scientific Computing</title></titleStmt>
			<publicationStmt>
				<publisher>Society for Industrial and Applied Mathematics</publisher>
				<date>08/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10548358</idno>
					<idno type="doi">10.1137/24M1648703</idno>
					<title level='j'>SIAM Review</title>
<idno>0036-1445</idno>
<biblScope unit="volume">66</biblScope>
<biblScope unit="issue">3</biblScope>					

					<author>Nicholas H Nelsen</author><author>Andrew M Stuart</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Supervised operator learning centers on the use of training data, in the form of input-output pairs, to estimate maps between infinite-dimensional spaces. It is emerging as apowerful tool to complement traditional scientific computing, which may often be framedin terms of operators mapping between spaces of functions. Building on the classical ran-dom features methodology for scalar regression, this paper introduces the function-valuedrandom features method. This leads to a supervised operator learning architecture thatis practical for nonlinear problems yet is structured enough to facilitate efficient trainingthrough the optimization of a convex, quadratic cost. Due to the quadratic structure, thetrained model is equipped with convergence guarantees and error and complexity bounds,properties that are not readily available for most other operator learning architectures. Atits core, the proposed approach builds a linear combination of random operators. Thisturns out to be a low-rank approximation of an operator-valued kernel ridge regression al-gorithm, and hence the method also has strong connections to Gaussian process regression.The paper designs function-valued random features that are tailored to the structure oftwo nonlinear operator learning benchmark problems arising from parametric partial differ-ential equations. Numerical results demonstrate the scalability, discretization invariance,and transferability of the function-valued random features method.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>for the ROMs. Other popular surrogate modeling techniques include Gaussian processes <ref type="bibr">[148]</ref>, polynomial chaos expansions <ref type="bibr">[135]</ref>, and radial basis functions <ref type="bibr">[146]</ref>, yet these are only practically suitable for scalar-valued maps with input space of low to moderate dimension, unless strong assumptions are placed on the problem. Classical numerical methods for PDEs may also represent the discretized forward model as a map R K \rightar R K , where K is the resolution, albeit implicitly in the form of a computer code (e.g., finite element, finite difference, finite volume methods). However, the approximation error is sensitive to K and repeated evaluations of this forward model often become cost prohibitive due to poor scaling with input dimension K.</p><p>Operator Learning. Many earlier attempts to build cheap-to-evaluate surrogate models for PDEs display sensitivity to discretization. There is a suite of work on data-driven discretizations of PDEs that allows for identification of the governing system <ref type="bibr">[8,</ref><ref type="bibr">20,</ref><ref type="bibr">100,</ref><ref type="bibr">119,</ref><ref type="bibr">137,</ref><ref type="bibr">142]</ref>. However, we note that only the operators appearing in the underlying equation itself are approximated with these approaches, not the solution operator of the PDE; the focus in these works is mostly on model discovery rather than model acceleration. More in line with the theme of the present paper, architectures based on deep convolutional NNs have proven to be quite successful for learning elliptic PDE solution operators. For example, see <ref type="bibr">[55,</ref><ref type="bibr">143,</ref><ref type="bibr">149,</ref><ref type="bibr">152]</ref>, which take an image-to-image regression approach. Other NNs have been used in similar elliptic problems for quantity of interest prediction <ref type="bibr">[72,</ref><ref type="bibr">81]</ref>, error estimation <ref type="bibr">[29]</ref>, or unsupervised learning <ref type="bibr">[95]</ref>, and for parametric PDEs more generally <ref type="bibr">[60,</ref><ref type="bibr">88,</ref><ref type="bibr">117,</ref><ref type="bibr">133</ref>]. Yet in most of the preceding approaches, the architectures and resulting error are dependent on the mesh resolution. To circumvent this issue, the surrogate map must be well defined on function space and independent of any finite-dimensional realization of the map that arises from discretization. This is not a new idea (see <ref type="bibr">[31,</ref><ref type="bibr">106,</ref><ref type="bibr">128]</ref> or, for functional data analysis, <ref type="bibr">[77,</ref><ref type="bibr">107,</ref><ref type="bibr">126]</ref>). The aforementioned reduced basis method is an example, as is the method of <ref type="bibr">[33,</ref><ref type="bibr">34]</ref>, which approximates the solution map with sparse Taylor polynomials and achieves optimal convergence rates in idealized settings. Early work in the use of NNs to learn the solution operator, or vector field, defining ODEs and time-dependent PDEs may be found from the 1990s <ref type="bibr">[31,</ref><ref type="bibr">64,</ref><ref type="bibr">106,</ref><ref type="bibr">127]</ref>. However, only recently have practical machine learning methods been designed to directly operate in infinite dimensions.</p><p>Several implementable operator learning architectures were developed concurrently <ref type="bibr">[1,</ref><ref type="bibr">19,</ref><ref type="bibr">97,</ref><ref type="bibr">101,</ref><ref type="bibr">111,</ref><ref type="bibr">113,</ref><ref type="bibr">150]</ref>. These include the DeepONet <ref type="bibr">[101]</ref>, which generalizes and makes practical the main idea in <ref type="bibr">[31]</ref>, PCA-Net <ref type="bibr">[19]</ref>, and the RFM from the original version of the present paper <ref type="bibr">[111]</ref>. These were followed by neural operators <ref type="bibr">[87,</ref><ref type="bibr">97]</ref> and, in particular, the Fourier neural operator <ref type="bibr">[96]</ref>. Details for and comparisons among these architectures are given in <ref type="bibr">[86, sect. 3]</ref>. Apart from the RFM, what these methods---which we collectively call ``neural operators""---share is a deep learning backbone. The approximation theory of such neural operators is fairly well developed <ref type="bibr">[69,</ref><ref type="bibr">72,</ref><ref type="bibr">83,</ref><ref type="bibr">85,</ref><ref type="bibr">87,</ref><ref type="bibr">89,</ref><ref type="bibr">90,</ref><ref type="bibr">91,</ref><ref type="bibr">93]</ref>. It includes qualitative universal approximation, i.e., density, results as well as quantitative parameter complexity bounds, that is, the number of NN parameters required to achieve accuracy \varepsi . The paper <ref type="bibr">[93]</ref> reveals a ``curse of parametric complexity"" in which the parameter complexity is shown to be exponentially large in powers of \varepsi - 1 to approximate general Lipschitz continuous operators. This aligns with the findings of older work <ref type="bibr">[106]</ref> and suggests that efficient neural operator learning is not possible without further assumptions. It turns out that the curse is lifted if enough regularity is assumed. For example, for linear or holomorphic target operators, efficient algebraic approximation rates may be Downloaded 10/11/24 to 131.215. <ref type="bibr">72.225</ref> . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> established <ref type="bibr">[2,</ref><ref type="bibr">69]</ref>. However, what rates are possible for sets of operators ``in between"" holomorphic and Lipschitz operators is still an open question.</p><p>One of the simplest classes of operators is that of linear operators. There is a substantial body of work in this setting ranging from the learning of general linear operators <ref type="bibr">[39,</ref><ref type="bibr">76,</ref><ref type="bibr">109,</ref><ref type="bibr">136]</ref> to estimating the Green's function of specific linear PDEs <ref type="bibr">[62,</ref><ref type="bibr">21,</ref><ref type="bibr">23,</ref><ref type="bibr">132]</ref> and Koopman operators <ref type="bibr">[84]</ref>. The linear setting allows for very thorough and sharp statistical analysis that leads to deep insights about the data efficiency of operator learning in terms of problem structure <ref type="bibr">[21,</ref><ref type="bibr">39,</ref><ref type="bibr">72]</ref>. Some sample complexity results have been obtained for nonlinear functionals and operators, which give the training dataset size needed to obtain \varepsi accuracy. Most of this theory depends on kernels, either in an RKHS framework <ref type="bibr">[27,</ref><ref type="bibr">92]</ref> or via local averaging (e.g., kernel smoothers) <ref type="bibr">[53,</ref><ref type="bibr">114]</ref>. Error bounds are obtained for encoder--decoder neural operators such as DeepONet and PCA-Net in <ref type="bibr">[99]</ref>. These results imply a ``curse of sample complexity,"" i.e., exponentially large sample sizes, for learning Lipschitz operators. Similar to the parameter complexity case, with enough regularity assumed on the operators of interest, as expressed through weighted tensor product structure, operator holomorphy, or analyticity, for example, minimax lower bounds can return to better behaved algebraic rates in the sample size <ref type="bibr">[3,</ref><ref type="bibr">27,</ref><ref type="bibr">73,</ref><ref type="bibr">74]</ref>. Moreover, there exist both constructive and nonconstructive estimators that achieve these algebraic convergence rates for operator learning <ref type="bibr">[2,</ref><ref type="bibr">27,</ref><ref type="bibr">92]</ref>.</p><p>Random Features. The RFM as a mapping between finite-dimensional spaces was formalized in the series of papers <ref type="bibr">[122,</ref><ref type="bibr">123,</ref><ref type="bibr">124]</ref>, building on earlier work in <ref type="bibr">[11,</ref><ref type="bibr">110,</ref><ref type="bibr">147]</ref>. The RFM is in some sense the simplest possible machine learning model; it may be viewed as an ensemble average of randomly parametrized functions: an expansion in a randomized basis with trainable coefficients. The method of random Fourier features is the most mainstream instantiation of the approach <ref type="bibr">[122]</ref>. Here, the RFM is used to approximate popular translation-invariant kernels by averages of sinusoidal functions with random frequencies. This approximation is then used downstream for kernel regression tasks <ref type="bibr">[67]</ref>. An equivalent viewpoint is that the RFM approximates the Gaussian process prior distribution in a Gaussian process regression method <ref type="bibr">[148]</ref>. However, the choice of random feature map can be much more general than just random sines and cosines. These random features could be defined, for example, by randomizing the internal parameters of an NN. Many papers take this viewpoint <ref type="bibr">[63,</ref><ref type="bibr">105,</ref><ref type="bibr">123,</ref><ref type="bibr">124]</ref>. Compared to NN emulators with enormous learnable parameter counts (e.g., O(10 5 ) to O(10 7 ); see <ref type="bibr">[51,</ref><ref type="bibr">52,</ref><ref type="bibr">95]</ref>) and methods that are intrusive or lead to nontrivial implementations <ref type="bibr">[33,</ref><ref type="bibr">94,</ref><ref type="bibr">131]</ref>, the RFM is one of the simplest models to formulate and train. Often O(10 4 ) or fewer linear expansion coefficients---which are the only free parameters in the RFM---suffice.</p><p>The theory of the RFM for real-valued outputs is well developed, partly due to its close connection to kernel methods <ref type="bibr">[7,</ref><ref type="bibr">26,</ref><ref type="bibr">75,</ref><ref type="bibr">122,</ref><ref type="bibr">146]</ref> and Gaussian processes <ref type="bibr">[110,</ref><ref type="bibr">147]</ref>, and it includes generalization bounds and dimension-free rates <ref type="bibr">[92,</ref><ref type="bibr">98,</ref><ref type="bibr">48,</ref><ref type="bibr">123,</ref><ref type="bibr">129,</ref><ref type="bibr">139,</ref><ref type="bibr">140]</ref>. A quadrature viewpoint on the RFM provides further insight and leads to Monte Carlo sampling ideas <ref type="bibr">[7]</ref>. As in modern deep learning practice, for some problem classes the RFM has been shown to perform well even when overparametrized <ref type="bibr">[14,</ref><ref type="bibr">48,</ref><ref type="bibr">105]</ref>. However, overparametrization is not necessary for good performance; state-of-the-art fast rates are established in the underparametrized regime by <ref type="bibr">[98,</ref><ref type="bibr">129]</ref>. The paper <ref type="bibr">[63]</ref> derives similar bounds for random NN approximation of functionals with a random feature-based training strategy.</p><p>For the supervised operator learning setting in which inputs and outputs are both Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> infinite-dimensional, kernel <ref type="bibr">[27,</ref><ref type="bibr">77]</ref> and Gaussian process methods <ref type="bibr">[12]</ref>---and hence random features---are less explored. The paper <ref type="bibr">[115]</ref> performs nonlinear operator learning in the encoder--decoder paradigm, where the input and output spaces are represented by truncated orthonormal bases and the finite-dimensional coefficient-tocoefficient mapping is performed with a kernel smoother. The kernel smoother is then approximated with random Fourier features <ref type="bibr">[122]</ref>. A similar idea uses a Gaussian process perspective <ref type="bibr">[12]</ref>. For high-dimensional input parameter spaces, the authors of <ref type="bibr">[65,</ref><ref type="bibr">80]</ref> analyze nonparametric kernel regression for parametric PDEs with real-valued solution map outputs. However, the preceding methods have poor computational scalability w.r.t. data dimension and sample size. The RFM alleviates these issues with randomization and efficient convex optimization. The specific random Fourier feature approach of Rahimi and Recht <ref type="bibr">[122]</ref> was generalized in <ref type="bibr">[24]</ref> to the finite-dimensional matrix-valued kernel setting with vector-valued random Fourier features and to the operator-valued kernel setting in <ref type="bibr">[108]</ref>. However, canonical operatorvalued kernels are hard to define and the preceding works require explicit knowledge of these kernels. Our viewpoint in the current paper is to develop function-valued random features and work directly with them as a standalone supervised learning method, choosing them for their properties and noting that they implicitly define a kernel, but not working directly with that kernel. An additional benefit of our approach is that it avoids the nonconvex training routines that plague more sophisticated neural operator architectures and, in particular, hinder the development of uncertainty quantification and comprehensive complexity bounds. The key idea underlying our methodology is to formulate the proposed operator random features algorithm on infinite-dimensional space and only discretize it at implementation time. This philosophy in algorithm development has been instructive in a number of areas in scientific computing, as we describe next.</p><p>Other Continuum Algorithms. The general philosophy of designing algorithms at the continuum level has been hugely successful across disciplines. In PDE-constrained optimization, there is the ``optimize-then-discretize"" principle <ref type="bibr">[71]</ref>. In applied probability, there are Markov chain Monte Carlo algorithms for sampling probability distributions supported on function spaces <ref type="bibr">[36]</ref>. The Bayesian formulation of inverse problems on Banach spaces provides another example <ref type="bibr">[138]</ref>. Work along similar lines extends numerical linear algebra routines for finite-dimensional vectors and matrices to new ones for infinite-dimensional functions and linear operators <ref type="bibr">[144,</ref><ref type="bibr">145]</ref>. All such methods inherit certain dimension-independent properties that make them more robust and possibly more accurate. Operator learning brings this powerful perspective to machine learning, where it has been promoted as a way of designing and analyzing learning algorithms <ref type="bibr">[66,</ref><ref type="bibr">47,</ref><ref type="bibr">130,</ref><ref type="bibr">44,</ref><ref type="bibr">45]</ref>. Our work may be understood within this general framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Contributions.</head><p>Our primary contributions in this paper are now listed. (C1) We develop the RFM, directly formulated on the function space level, for learning nonlinear operators between Banach spaces purely from data. As a method for parametric PDEs, the methodology is nonintrusive but also has the additional advantage that it may be used in settings where only data is available and no model is known. (C2) We show that our proposed method is more computationally tractable to both train and evaluate than standard kernel methods in infinite dimensions. Furthermore, we show that the method is equivalent to kernel ridge regression performed in a finite-dimensional space spanned by random features and Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> comes equipped with a full convergence theory. (C3) We apply our operator learning methodology to learn the semigroup defined by the solution operator for the viscous Burgers equation and the coefficientto-solution operator for the Darcy flow equation. (C4) We perform numerical experiments that demonstrate two mesh-independent approximation properties that are built into the proposed methodology: invariance of relative error to mesh resolution and evaluation ability on any mesh resolution. The remainder of this paper is structured as follows. In section 2, we communicate the mathematical framework required to work with the RFM in infinite dimensions, identify an appropriate approximation space, explain the training procedure, and review recent error bounds for the method. We introduce two instantiations of random feature maps that target physical science applications in section 3 and detail the corresponding numerical results for these applications in section 4. We conclude in section 5 with a summary and directions for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology.</head><p>In this work, the overarching problem of interest is the approximation of a map F \dagger : \scrX \rightar \scrY , where \scrX and \scrY are infinite-dimensional spaces of real-valued functions defined on some bounded open subset of R d , and F \dagger is defined by a \mapsto \rightar F \dagger (a) := u, where u \in \scrY is the solution of a (possibly time-dependent) PDE and a \in \scrX is an input function required to make the problem well-posed. Our proposed approach for this approximation, constructing a surrogate map F for the true map F \dagger , is data-driven, nonintrusive, and based on least squares. Least squares-based methods are integral to the random feature methodology as proposed in low dimensions <ref type="bibr">[122,</ref><ref type="bibr">123]</ref> and generalized here to the infinite-dimensional setting. They have also been shown to work well in other algorithms for high-dimensional numerical approximation <ref type="bibr">[18,</ref><ref type="bibr">35,</ref><ref type="bibr">42]</ref>. Within the broader scope of reduced order modeling techniques <ref type="bibr">[15]</ref>, the approach we adopt in this paper falls within the class of data-fit emulators. Essentially, our method approximates the solution manifold \scrM = \bigl\{ u \in \scrY : u = F \dagger (a) and a \in \scrX \bigr\} (2.1) on average. The solution map F \dagger , often being the inverse of a differential operator, is usually smoothing and admits some notion of compactness. Then, the idea is that \scrM should have some compact, low-dimensional structure or intrinsic dimension. However, actually finding a model F that exploits this structure despite the high dimensionality of the truth map F \dagger is quite difficult. Further, the effectiveness of many model reduction techniques, such as those based on the reduced basis method, are dependent on inherent properties of the map F \dagger itself (e.g., analyticity), which in turn may influence the decay rate of the Kolmogorov width of the manifold \scrM <ref type="bibr">[34]</ref>. While such subtleties of approximation theory are crucial to developing rigorous theory and provably convergent algorithms <ref type="bibr">[86]</ref>, we choose to work in the nonintrusive setting where knowledge of the map F \dagger and its associated PDE are only obtained through measurement data, and hence detailed characterizations such as those mentioned previously are essentially unavailable. Thus, we emphasize that our proposed operator learning methodology is applicable to general continuum problems with function space data, not just to PDEs.</p><p>The remainder of this section introduces the mathematical preliminaries for our methodology. With the goal of operator approximation in mind, in subsection 2.1 we formulate a supervised learning problem in an infinite-dimensional setting. We provide the necessary background on RKHSs in subsection 2.2 and then define the Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> RFM in subsection 2.3. In subsection 2.4, we describe the optimization principle that leads to implementable algorithms for the RFM and an example problem in which \scrX and \scrY are one-dimensional vector spaces. We finish by providing two convergence results for trained function-valued RFMs in subsection 2.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Problem Formulation.</head><p>Let \scrX and \scrY be real Banach spaces and F \dagger : \scrX \rightar \scrY be a (possibly nonlinear) map. It is natural to frame the approximation of F \dagger as a supervised learning problem. Suppose we are given training data in the form of input-output pairs \{ (a i , y i )\} n i=1 \subset \scrX \times \scrY , where a i \sim \nu are independent and identically distributed (i.i.d.), \nu is a probability measure supported on \scrX , and y i is given by F \dagger (a i ) \sim F \dagger \sharp \nu plus, potentially, noise. In the examples in this paper, the noise is viewed as resulting from model error (the PDE does not perfectly represent the physics) or from discretization error (in approximating the PDE); situations in which the data acquisition process is inherently noisy can also be envisioned <ref type="bibr">[92]</ref> but are not explicitly studied here. We aim to build a parametric reconstruction of the true map F \dagger from the data; that is, construct a model F : \scrX \times \scrP \rightar \scrY and find \alpha \dagger \in \scrP \subsete R m such that F ( \cdot , \alpha \dagger ) \approx F \dagger are close as maps from \scrX to \scrY in some suitable sense. The natural number m here denotes the total number of model parameters.</p><p>The standard approach to determining parameters in supervised learning is to define a loss functional \ell : \scrY \times \scrY \rightar R \geq 0 and minimize the expected risk, min \alpha \in \scrP</p><p>E a\sim \nu \Bigl[ \ell \bigl( F \dagger (a), F (a, \alpha ) \bigr) \Bigr] . (2.2) With only the data \{ (a i , y i )\} n i=1 at our disposal, we approximate problem (2.2) by replacing \nu with the empirical measure \nu (n) := 1 n \sum n i=1 \delta ai , which leads to the empirical risk minimization problem min \alpha \in \scrP 1 n n \sum i=1 \ell \bigl( y i , F (a i , \alpha ) \bigr) . (2.3)</p><p>The hope is that given minimizer \alpha (n) of (2.3) and \alpha \dagger of (2.2), F ( \cdot , \alpha (n) ) well approximates F ( \cdot , \alpha \dagger ), that is, the learned model generalizes well; these ideas may be made rigorous with results from statistical learning theory <ref type="bibr">[68]</ref>. Solving problem (2.3) is called training the model F . Once trained, the model is then validated on a new set of i.i.d. input-output pairs previously unseen during the training process. This testing phase indicates how well F approximates F \dagger . In what follows, we assume that (\scrY , \langle \cdot , \cdot \rangle \scrY , \| \cdot \| \scrY ) is a real separable Hilbert space and focus on the squared loss \ell (y, y \prime ) := 1 2 \| yy \prime \| 2 \scrY . (2.4)</p><p>We stress that our entire formulation is in an infinite-dimensional setting and we will remain in this setting throughout the paper; as such, the random feature methodology we propose will inherit desirable discretization-invariant properties, to be observed in the numerical experiments of section 4.</p><p>Notation 2.1 (expectation). For a Borel measurable map G : \scrU \rightar \scrV between two Banach spaces \scrU and \scrV and a probability measure \pi supported on \scrU , we denote the expectation of G under \pi by</p><p>in the sense of Bochner integration (see, e.g., <ref type="bibr">[38,</ref> sect. A.2]). Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> 2.2. Operator-Valued Reproducing Kernels.</p><p>The RFM is naturally formulated in an RKHS setting, as our exposition will demonstrate in subsection 2.3. However, the usual RKHS theory is concerned with real-valued functions <ref type="bibr">[5,</ref><ref type="bibr">16,</ref><ref type="bibr">37,</ref><ref type="bibr">146]</ref>. Our setting, with the output space \scrY a separable Hilbert space, requires several ideas that generalize the real-valued case. We now outline these ideas with a review of operator-valued kernels; parts of the presentation that follow may be found in the references <ref type="bibr">[7,</ref><ref type="bibr">28,</ref><ref type="bibr">107,</ref><ref type="bibr">112]</ref>.</p><p>We first consider the special case \scrY := R for ease of exposition. A real RKHS is a Hilbert space (\scrH , \langle \cdot , \cdot \rangle \scrH , \| \cdot \| \scrH ) comprising real-valued functions f : \scrX \rightar R such that the pointwise evaluation functional f \mapsto \rightar f (a) is bounded for every a \in \scrX . It then follows that there exists a unique, symmetric, positive definite kernel function k : \scrX \times \scrX \rightar R such that for every a \in \scrX , we have k(\cdot , a) \in \scrH and the reproducing kernel property f (a) = \langle k(\cdot , a), f \rangle \scrH holds. These two properties are often taken as the definition of an RKHS. The converse direction is also true: every symmetric, positive definite kernel defines a unique RKHS <ref type="bibr">[5]</ref>.</p><p>We now introduce the necessary generalization of the reproducing property to the case of arbitrary real Hilbert spaces \scrY , as this result will motivate the construction of the RFM. Kernels in this setting are now operator-valued. Definition 2.2 (operator-valued kernel). Let \scrX be a real Banach space and \scrY a real separable Hilbert space. An operator-valued kernel is a map k : \scrX \times \scrX \rightar \scrL (\scrY ) , <ref type="bibr">(2.6)</ref> where \scrL (\scrY ) denotes the Banach space of all bounded linear operators on \scrY , such that its adjoint satisfies k(a, a \prime ) \ast = k(a \prime , a) for all a and a \prime in \scrX and, for every N \in N, for all pairs \{ (a i , y i )\} N i=1 \subset \scrX \times \scrY . Paralleling the development for the real-valued case, an operator-valued kernel k also uniquely (up to isomorphism) determines an associated real RKHS \scrH k = \scrH k (\scrX ; \scrY ) of operators mapping \scrX to \scrY . Now, choosing a probability measure \nu supported on \scrX , we define a kernel integral operator T k associated to k by</p><p>which is nonnegative, self-adjoint, and compact (provided k(a, a) \in \scrL (\scrY ) is compact for all a \in \scrX <ref type="bibr">[28]</ref>). Let us further assume that all conditions needed for T ). Generalizing the standard Mercer theory (see, e.g., <ref type="bibr">[7,</ref><ref type="bibr">16]</ref>), we may write the RKHS inner product as</p><p>Note that while (2.9) appears to depend on the measure \nu on \scrX , the set \scrH k itself is determined by the kernel without any reference to a measure <ref type="bibr">[37,</ref><ref type="bibr">Chap. 3,</ref><ref type="bibr">Thm. 4]</ref>. With the inner product now explicit, it is possible to deduce the following reproducing property of the operator-valued kernel k [111, sect. The identity (2.10), paired with a special choice of operator-valued kernel k, is the basis of the RFM in our abstract infinite-dimensional setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Random Feature Model.</head><p>One could approach the approximation of target map F \dagger : \scrX \rightar \scrY from the perspective of kernel methods. However, it is generally a difficult task to explicitly design operator-valued kernels of the form (2.6) since the spaces \scrX and \scrY may be of different regularity, for example. Example constructions of operator-valued kernels studied in the literature include those taking value as diagonal operators, multiplication operators, or composition operators <ref type="bibr">[77,</ref><ref type="bibr">107,</ref><ref type="bibr">116]</ref>, but these all involve some simple generalization of scalar-valued kernels or strong assumptions about \scrY . Instead, the RFM allows one to implicitly work with fully general operator-valued kernels through the use of a random feature map \varphi : \scrX \times \Theta \rightar \scrY and a probability measure \mu supported on Banach space \Theta . The map \varphi is assumed to be square integrable w.r.t. the product measure \nu \times \mu , i.e., \varphi \in L 2 \nu \times \mu (\scrX \times \Theta ; \scrY ), where \nu is the (sometimes a modeling choice at our discretion, sometimes unknown) data distribution on \scrX . Together, (\varphi , \mu ) form a random feature pair. With this setup in place, we now describe the connection between random features and kernels. To this end, recall the following standard notation. Notation 2.4 (outer product). Given a Hilbert space (H, \langle \cdot , \cdot \rangle , \| \cdot \| ), the outer product a \otimes b \in \scrL (H) is defined by (a \otimes b)c = \langle b, c\rangle a for any a, b, and c \in H. Such representations need not be unique; different pairs (\varphi , \mu ) may induce the same kernel k = k \mu in <ref type="bibr">(2.11)</ref>. Since k \mu may readily be shown to be an operator-valued kernel via Definition 2.2, it defines a unique real RKHS \scrH k\mu \subset L 2 \nu (\scrX ; \scrY ). Our methodology will be based on this space and, in particular, finite-dimensional approximations thereof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.1.">An Intractable Nonparametric Model</head><p>We now perform a purely formal but instructive calculation, following from application of the reproducing property (2.10) to operator-valued kernels of the form <ref type="bibr">(2.11)</ref>. Doing so leads to an integral representation of any F \in \scrH k\mu . For all a \in \scrX and y \in \scrY , it holds that</p><p>\sim \mu \bigl[ \langle \varphi (a; \theta ), y\rangle \scrY \varphi ( \cdot ; \theta ) \bigr] , F \Bigr\ra \scrH k\mu = E \theta \sim \mu \Bigl[ \langle \varphi (a; \theta ), y\rangle \scrY \langle \varphi ( \cdot ; \theta ), F \rangle \scrH k\mu \Bigr] = E \theta \sim \mu \bigl[ c F (\theta )\langle y, \varphi (a; \theta )\rangle \scrY \bigr] = \Bigl\la y, E \theta \sim \mu \bigl[ c F (\theta )\varphi (a; \theta ) \bigr] \Bigr\ra \scrY , where the coefficient function c F : \Theta \rightar R is defined by \theta \mapsto \rightar c F (\theta ) := \langle \varphi ( \cdot ; \theta ), F \rangle \scrH k\mu . (2.12) Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>org/terms-privacy</head><p>Since \scrY is Hilbert, the above holding for all y \in \scrY implies the integral representation</p><p>The formal expression (2.12) for c F (\theta ) needs careful interpretation, which is provided in <ref type="bibr">[111,</ref><ref type="bibr">App. B]</ref> of the original version of this paper. For instance, if \varphi ( \cdot ; \theta ) is chosen to be a realization of a Gaussian process (as seen later in Example 2.9), then \varphi ( \cdot ; \theta ) / \in \scrH k\mu with probability 1; indeed, in this case c F is defined only as an L 2 \mu (\Theta ; R) limit. Nonetheless, the RKHS may be completely characterized by this integral representation. Define the map</p><p>The map \scrA may be shown to be a bounded linear operator that is a particular square root of T k\mu from ( <ref type="formula">2</ref>.8) [111, App. B]. We have the following result whose proof, provided in the original version of this paper [111, App. A], is a straightforward generalization of the real-valued case given in [7, sect. 2.2]. Result 2.5 (infinite-dimensional RKHS). Under the assumption that the feature map \varphi satisfies \varphi \in L 2 \nu \times \mu (\scrX \times \Theta ; \scrY ), the RKHS defined by the kernel k \mu in (2.11) is</p><p>We stress that the integral representation of mappings in RKHS (2.15) is not unique, since \scrA is not injective in general. However, the particular choice c = c F (2.12) in representation (2.13) does enjoy a sense of uniqueness as described in <ref type="bibr">[111,</ref><ref type="bibr">App. B]</ref>. In particular, the L 2 \mu (\Theta ; R) norm of c F equals the \scrH k\mu norm of F . The formula (2.15) suggests that \scrH k\mu , which is built from (\varphi , \mu ) and completely determined by coefficient functionals c \in L 2 \mu (\Theta ; R), is a natural nonparametric class of operators with which to perform approximation. However, the actual implementation of estimators based on the model class \scrH k\mu is known to incur enormous computational cost without further assumptions on the structure of (\varphi , \mu ), as we discuss later in this section. Instead, we next adopt a parametric approximation to this full RKHS approach. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.2.">A Tractable Parametric Model</head><p>\varphi (a; \theta j ) \otimes \varphi (a \prime ; \theta j ) . (2.17)</p><p>We then let \scrH k (m) be the unique RKHS induced by the kernel k (m) ; note that k (m) and hence \scrH k (m) are themselves random. The following characterization of \scrH k (m) is proved in the original version of this paper <ref type="bibr">[111,</ref> App. A]. Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> Result 2.6 (finite-dimensional RKHS). Assume that \varphi \in L 2 \nu \times \mu (\scrX \times \Theta ; \scrY ) and that the random features \{ \varphi ( \cdot ; \theta j )\} m j=1 are linearly independent in L 2 \nu (\scrX ; \scrY ). Then the RKHS \scrH k (m) is equal to the linear span of \{ \varphi ( \cdot ; \theta j )\} m j=1 . Applying a simple Monte Carlo sampling approach to elements in RKHS (2.15) by replacing probability measure \mu by empirical measure \mu (m) gives the intuition that 1 m m \sum j=1 c(\theta j )\varphi ( \cdot ; \theta j ) \approx E \theta \sim \mu \bigl[ c(\theta )\varphi ( \cdot ; \theta ) \bigr] for c \in L 2 \mu (\Theta ; R) . (2.18)</p><p>This low-rank approximation achieves the Monte Carlo rate O(m - 1/2 ) in expectation and, by virtue of Result 2.6, is in \scrH k (m) . However, in the setting of this work, the Monte Carlo approach does not give rise to a practical method for learning a target map F \dagger \in \scrH k\mu because F \dagger , k \mu , and \scrH k\mu are all unknown; only the random feature pair (\varphi , \mu ) is assumed to be given. Hence one cannot apply (2.12) or [111, eq. (B.2), p. A3239] to evaluate c = c F \dagger in (2.18). Furthermore, in realistic settings it may be that F \dagger \not \in \scrH k\mu , which leads to an additional smoothness misspecification gap not accounted for by the Monte Carlo method. To sidestep these difficulties, the RFM adopts a data-driven optimization approach to determine a different estimator of F \dagger , also from the space \scrH k (m) . We now define the RFM. Definition 2.7 (RFM). Given probability spaces (\scrX , \scrB (\scrX ), \nu ) and (\Theta , \scrB (\Theta ), \mu ), with \scrX and \Theta being real finite-or infinite-dimensional Banach spaces, a real separable Hilbert space \scrY , and \varphi \in L 2 \nu \times \mu (\scrX \times \Theta ; \scrY ), the RFM is the parametric map We use the Borel \sigma -algebras \scrB (\scrX ) and \scrB (\Theta ) to define the probability spaces in the preceding definition. Our goal with the RFM is to choose parameters \alpha \in R m to approximate mappings F \dagger \in \scrH k\mu (in the well-specified setting) by mappings F m (\cdot ; \alpha ) \in \scrH k (m) . The RFM is itself random and may be viewed as a spectral method because the randomized family \{ \varphi ( \cdot ; \theta )\} in the linear expansion <ref type="bibr">(2.19</ref>) is defined \nualmost everywhere on \scrX . Determining the coefficient vector \alpha from data obviates the difficulties associated with the oracle Monte Carlo approach because the datadriven method only requires knowledge of the pair (\varphi , \mu ) and knowledge of sample input-output pairs from target operator F \dagger .</p><p>As written, (2. <ref type="bibr">19</ref>) is incredibly simple. The operator F m is nonlinear in its input a but linear in its coefficient parameters \alpha . In practice, the linearity w.r.t. the RFM parameters is broken by also learning hyperparameters that appear in the pair (\varphi , \mu ). Moreover, similar to operator learning architectures such as neural operators <ref type="bibr">[87]</ref> and Fourier neural operators <ref type="bibr">[96]</ref>, the RFM is a nonlinear approximation. This means that the output F m (a; \alpha ) of the RFM belongs to a nonlinear manifold in \scrY (cf. (2.1)) instead of a fixed linear subspace of \scrY . In contrast, methods such as PCA-Net <ref type="bibr">[19]</ref> and DeepONet <ref type="bibr">[101]</ref> are restricted to such fixed linear spaces, which may limit their approximation power for specific classes of problems. More theory is required to quantitatively separate these two classes of approximation method.</p><p>Overall, it is clear that the choice of random feature map and measure pair (\varphi , \mu ) will determine the quality of approximation. Most papers deploying these methods, Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> including <ref type="bibr">[24,</ref><ref type="bibr">122,</ref><ref type="bibr">123]</ref>, take a kernel-oriented perspective by first choosing a kernel and then finding a random feature map to estimate this kernel. Our perspective, more aligned with <ref type="bibr">[124,</ref><ref type="bibr">140]</ref>, is the opposite in that we allow the choice of random feature map \varphi and distribution \mu to implicitly define the kernel via the formula (2.11), instead of picking the kernel first. This viewpoint also has implications for numerics: the kernel never explicitly appears in any computations, which leads to memory and other cost savings. It does, however, leave open the question of characterizing the universality <ref type="bibr">[140]</ref> of such kernels and the RKHS \scrH k\mu of mappings from \scrX to \scrY that underlies the approximation method; this is an important avenue for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.3.">Connection to Neural Networks and Neural</head><p>Operators. The close connection to kernels explains the origins of the RFM in the machine learning literature. Moreover, the RFM may also be interpreted in the context of NNs <ref type="bibr">[110,</ref><ref type="bibr">140,</ref><ref type="bibr">147,</ref><ref type="bibr">151]</ref>. To see this explicitly, consider the setting in which \scrX and \scrY are both equal to the Euclidean space R and choose \varphi to be a family of hidden neurons of the form \varphi NN (a; \theta ) := \sigma (\theta (1) \cdot a + \theta (2) ), where \sigma ( \cdot ) is a nonlinear activation function. A single hidden layer NN would seek to find \{ (\alpha j , \theta</p><p>matches the given training data \{ (a i , y i )\} n i=1 \subset \scrX \times \scrY . More generally, and in arbitrary Euclidean spaces, one may allow \varphi NN ( \cdot ; \theta ) to be any deep NN. However, while the RFM has the same form as (2.20), there is a difference in the training: the \theta j are drawn i.i.d. from a probability measure and then fixed, and only the \alpha j are chosen to fit the training data. This idea immediately transfers to the operator learning setting in which \scrX and \scrY are function spaces and the maps \varphi NN ( \cdot ; \theta ) are themselves randomly initialized deep neural operators or DeepONets. Given any deep NN with randomly initialized parameters, studies of the lazy training regime and neural tangent kernel <ref type="bibr">[26,</ref><ref type="bibr">75]</ref> suggest that adopting an RFM approach and only optimizing over the last layer weights \alpha is quite natural. Indeed, it is observed that in this regime the internal NN parameters do not stray far from their random initialization during gradient descent, while the last layer of parameters \{ \alpha j \} m j=1 adapts considerably. Once the feature parameters \{ \theta j \} m j=1 are sampled at random and fixed, training the RFM F m only requires optimizing over \alpha \in R m . Due to the linearity of F m in \alpha , this is a straightforward task that we now describe.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Optimization.</head><p>One of the most attractive characteristics of the RFM is its training procedure. With the L 2 -type loss (2.4) as in standard regression settings, optimizing the coefficients of the RFM w.r.t. the empirical risk (2.3) is a convex optimization problem, requiring only the solution of a finite-dimensional system of linear equations; the convexity also suggests the possibility of appending convex constraints (such as linear inequalities), although we do not pursue this here. Further, the kernels k \mu or k (m) are not required anywhere in the algorithm. We emphasize the simplicity of the underlying optimization tasks as they suggest the possibility of numerical implementation of the RFM in complicated black-box computer codes. This is in contrast with other methods such as deep neural operators, which are trained with variants of stochastic gradient descent. Such a training strategy leads to nonconvexity that is notoriously difficult to study, both computationally and theoretically.</p><p>We now show that a regularized version of the quadratic optimization problem (2.3)--(2.4) arises naturally from approximation of a nonparametric regression Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> problem defined over the RKHS \scrH k\mu . To this end, recall the supervised learning formulation in subsection 2.1. Given n i.i.d. input-output pairs \{ (a i , y i )\} n i=1 \subset \scrX \times \scrY as data, with the a i drawn from (possibly unknown) probability measure \nu on \scrX and y i = F \dagger (a i ), the objective is to find an approximation \widehat F to the map F \dagger . Let \scrH k\mu be the hypothesis space and k \mu its operator-valued reproducing kernel of the form <ref type="bibr">(2.11)</ref>. The most straightforward learning algorithm in this RKHS setting is kernel ridge regression, also known as penalized least squares. This method produces a nonparametric model by finding a minimizer \widehat F of min</p><p>where \lambda \geq 0 is a penalty parameter. By the representer theorem for operator-valued kernels [107, Thms. 2 and 4], the minimizer has the form</p><p>for some functions \{ \beta i \} n i=1 \subset \scrY . In practice, finding these n functions in the output space requires solving a block linear operator equation. For the high-dimensional PDE problems we consider in this work, solving such an equation may become prohibitively expensive due to both operation count and memory required. A few workarounds were proposed in <ref type="bibr">[77]</ref> such as certain diagonalizations, but these rely on simplifying assumptions that are somewhat limiting. More fundamentally, the representation of the solution in (2.22) requires knowledge of the kernel k \mu ; in our setting we assume access only to the random feature pair (\varphi , \mu ), which defines k \mu and not k \mu itself.</p><p>We thus explain how to make progress with this problem given knowledge only of random features. Recall the empirical kernel given by (2.17), the RKHS \scrH k (m) , and Result 2.6. The following result, proved in <ref type="bibr">[111,</ref> App. A], shows that an RFM hypothesis class with a penalized least squares empirical loss function in optimization problem (2.3)--(2.4) is equivalent to kernel ridge regression (2.21) restricted to \scrH k (m) . Result 2.8 (random feature ridge regression is equivalent to a kernel method). Assume that \varphi \in L 2 \nu \times \mu (\scrX \times \Theta ; \scrY ) and that the random features \{ \varphi ( \cdot ; \theta j )\} m j=1 are linearly independent in L 2 \nu (\scrX ; \scrY ). Fix \lambda \geq 0. Let \widehat \alpha \in R m be the unique minimum norm solution of min \alpha \in R m \Biggl\{ n \sum i=1 1 2 \bigm\| \bigm\| \bigm\| \bigm\| y i -1 m m \sum l=1 \alpha l \varphi (a i ; \theta l ) \bigm\| \bigm\| \bigm\| \bigm\| 2 \scrY + \lambda 2m \| \alpha \| 2 2 \Biggr\} . (2.23) Then the RFM defined by this choice of \alpha = \widehat \alpha satisfies F m ( \cdot ; \widehat \alpha ) = argmin F \in \scrH k (m) \Biggl\{ n \sum i=1 1 2 \bigm\| \bigm\| y i -F (a i ) \bigm\| \bigm\| 2 \scrY + \lambda 2 \bigm\| \bigm\| F \bigm\| \bigm\| 2 \scrH k (m) \Biggr\} . (2.24) Solving the convex problem (2.23) trains the RFM. The first order condition for a global minimizer leads to the normal equations m \sum j=1 \Biggl( 1 m n \sum i=1 \bigl\la \varphi (a i ; \theta l ), \varphi (a i ; \theta j ) \bigr\ra \scrY + \lambda \delta lj \Biggr) \alpha j = n \sum i=1 \bigl\la \varphi (a i ; \theta l ), y i \bigr\ra \scrY (2.25) Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> </p><p>for each l \in \{ 1, . . . , m\} , where \delta lj = 1 if l = j, and equals zero otherwise. This is an m-by-m linear system of equations for \alpha \in R m that is standard to solve. In the case \lambda = 0, the minimum norm solution of (2.25) may be written in terms of a pseudoinverse operator (see <ref type="bibr">[102, sect. 6.11]</ref>).</p><p>Equation (2.25) reveals that the trained RFM F m ( \cdot ; \widehat \alpha ) is a linear function of the labeled output data \{ y i \} n i=1 . This property is undesirable from the perspective of statistical optimality. Indeed, it is known that any estimator that is linear in the output training data is minimax suboptimal for certain classes of problems [141, Thm. 1, sect. 4.1, p. 6]. However, any adaptation of the feature pair (\varphi , \mu ) to the training data will break this property and potentially restore optimality. For example, choosing \lambda or hyperparameters appearing in (\varphi , \mu ) based on a cross-validation procedure would make the RF pair data-dependent as desired. This is typically done in practice.</p><p>Example 2.9 (Brownian bridge). We now provide a one-dimensional instantiation of the RFM to illustrate the methodology. Take the input space as \scrX := (0, 1), output space as \scrY := R, input space measure \nu := \sansU \sansn \sansi \sansf (0, 1) to be uniform, and random parameter space as \Theta := R \infty . Denote the input by a = x \in \scrX . Then, consider the random feature map \varphi : (0, 1) \times R \infty \rightar R defined by the Brownian bridge \varphi (x; \theta ) := \infty \sum j=1 \theta (j) (j\pi ) - 1 \surd 2 sin(j\pi x) , where \theta (j) iid \sim N (0, 1) , <ref type="bibr">(2.26)</ref> where \theta := \{ \theta (j) \} j\in N and \mu := N (0, 1) \times N (0, 1) \times \cdot \cdot \cdot . For any realization of \theta \sim \mu , the function \varphi ( \cdot ; \theta ) is a Brownian motion constrained to zero at x = 0 and x = 1. The induced kernel k \mu : (0, 1) \times (0, 1) \rightar R is then simply the covariance function of this stochastic process:</p><p>Note that k \mu is the Green's function for the negative Laplacian on (0, 1) with Dirichlet boundary conditions. Using this fact, we may explicitly characterize the associated RKHS \scrH k\mu as follows. First, we have</p><p>where the negative Laplacian has domain H 2 ((0, 1); R) \cap H 1 0 ((0, 1); R). Viewing T k\mu as an operator from L 2 ((0, 1); R) into itself, from (2.9) we conclude, upon integration by parts, that for any elements f and g of \scrH k\mu , it holds that \langle f, g\rangle</p><p>Note that the last identity does indeed define an inner product on H 1 0 . By this formal argument we identify the RKHS \scrH k\mu as the Sobolev space H 1 0 ((0, 1); R). Furthermore, the Brownian bridge may be viewed as the Gaussian measure N (0, T k\mu ). Approximation using the RFM with the Brownian bridge random features is illustrated in Figure <ref type="figure">1</ref>. Since k \mu (\cdot , x) is a piecewise linear function, a kernel interpolation or regression method will produce a piecewise linear approximation. Indeed, the figure indicates that the RFM with n training points fixed approaches the optimal piecewise linear kernel interpolant as m \rightar \infty . The Brownian bridge in Example 2.9 illuminates a more fundamental idea. For this low-dimensional problem, an expansion in a deterministic Fourier sine basis would of course be more natural. However, if we do not have a natural, computable orthonormal basis, then randomness provides a useful alternative representation; notice that the random features each include random combinations of the deterministic Fourier sine basis in this example. For the more complex problems that we study numerically in the next two sections, we lack knowledge of good, computable bases for general maps in infinite dimensions. The RFM approach exploits randomness to explore, implicitly discover the structure of, and represent such maps. Thus we now turn away from this example of real-valued maps defined on a subset of the real line and instead consider the use of random features to represent maps between spaces of functions. It turns out that theoretical guarantees are still possible to obtain in this setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5.">Error Bounds.</head><p>In this subsection, we review a recent comprehensive error analysis <ref type="bibr">[92]</ref> of the random feature ridge regression problem (2.24) in the general infinite-dimensional input and output space setting. This is the sharpest available theory for misspecified problems. Owing to its tractable optimization, the RFM Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> is one of the first guaranteed convergent operator learning algorithms for nonlinear problems that is actually implementable on a computer with controlled complexity. To see this more concretely, we require the following technical assumptions.</p><p>Assumption 2.10 (data and features). The following hold true: (i) The ground truth operator F \dagger satisfies F \dagger \in L \infty \nu (\scrX ; \scrY ). (ii) The noise-free training data are given by a i iid \sim \nu and y i = F \dagger (a i ) for each i. (iii) The random feature map \varphi \in L \infty \nu \times \mu (\scrX \times \Theta ; \scrY ) is measurable and bounded. (iv) The RKHS \scrH k\mu corresponding to the pair (\varphi , \mu ) is separable.</p><p>Our first convergence result is qualitative and follows from <ref type="bibr">[92,</ref><ref type="bibr">Thm. 3</ref>.10, p. 6], which itself is a consequence of a more general error estimate [92, Thm. 3.4, pp. 4--5].<ref type="foot">foot_0</ref> Theorem 2.11 (almost sure convergence of trained RFM). Let Assumption 2.10 hold. Suppose that the integral operator T k\mu \in \scrL (L 2 \nu (\scrX ; \scrY )) in (2.8) is injective. Let \{ \delta l \} l\in N \subset (0, 1) be any positive sequence with the property that \sum \infty l=1 \delta l &lt; \infty . For l \in N, denote by \widehat \alpha (l) \in R m l the trained RFM coefficients corresponding to (2.23) with m = m l random features, n = n l training samples, and regularization parameter \lambda = \lambda l . If m l \simeq \delta - 1 l log(2/\delta l ) , n l \simeq \delta - 2 l log(2/\delta l ) , and \lambda l \simeq m l , (2.30) then the trained RFM satisfies P \biggl\{ lim l\rightar\infty E a\sim \nu \bigm\| \bigm\| F \dagger (a) -F m l (a; \widehat \alpha (l) ) \bigm\| \bigm\| 2 \scrY</p><p>The probability in (2.31) is w.r.t. the joint law of the data \{ a i \} n i=1 \sim \nu \otimes n and the feature parameters \{ \theta j \} M j=1 \sim \mu \otimes m . Going beyond the existence of an accurate approximation to F \dagger , Theorem 2.11 shows that the random feature ridge regression algorithm delivers a strongly consistent statistical estimator of F \dagger in the limit of large m, n, and \lambda . That is, the trained RFM that one actually obtains in practice converges (along a subsequence w.r.t. n) to the true underlying operator F \dagger with probability 1. The three quantities m, n, and \lambda are linked via a summable sequence \{ \delta l \} , which determines how they are simultaneously sent to infinity. The conditions of the theorem are satisfied with \delta l = l - 2 \rightar 0, for example.</p><p>The next theorem delivers a high probability nonasymptotic error bound that includes both parameter and sample complexity contributions that depend only algebraically on the reciprocal of the error, instead of exponentially [86, sect. 5]. It is a consequence of [92, Thm. 3.7, p. 5] and controls sources of error due to regularization, finite parametrization, finite data, and optimization. Theorem 2.12 (complexity bounds for trained RFM). Let \varepsi \in (0, 1) be an arbitrary error tolerance. Let \widehat \alpha \in R m denote the trained RFM coefficients from (2.23) with training sample size n \in N and regularization parameter \lambda \in (0, n). Suppose that F \dagger belongs to the RKHS \scrH k\mu corresponding to the random feature pair (\varphi , \mu ). Under Assumption 2.10, there exists an absolute constant C &gt; 0 such that if m \geq 11\varepsi - 2 , n \geq 10\varepsi - 4 , and \lambda \leq 10\varepsi</p><p>- 2 , (2.32) then the trained RFM F m ( \cdot ; \widehat \alpha ) satisfies the high probability L 2 \nu (\scrX ; \scrY ) error bound P \biggl\{ \sqrt{} E a\sim \nu \bigm\| \bigm\| F \dagger (a) -F m (a; \widehat \alpha ) \bigm\| \bigm\| 2 \scrY \leq \bigl( C\| F \dagger \| \scrH k\mu \bigr) \varepsi \biggr\} \geq 0.999 . (2.33)</p><p>The takeaway from Theorem 2.12 is that, up to constant factors, an appropriately tuned regularization parameter \lambda \simeq \surd n and number of random features m \simeq \surd n are enough to guarantee a trained RFM generalization error of size n - 1/4 \simeq m - 1/2 with high probability. However, this quantitative result is dependent on the wellspecified condition F \dagger \in \scrH k\mu , which is quite difficult to verify in practice. It would be interesting to identify concrete operators of interest that actually belong to such RKHSs. Similar questions are also open for the <ref type="bibr">Barron [46]</ref> and operator Barron spaces <ref type="bibr">[83]</ref> that correspond to NN models instead of RFMs.</p><p>The parameter complexity bound m \gtrsim \varepsi - 2 in (2.32) corresponds to the standard `Monte Carlo"" parametric rate of estimation. Due to the i.i.d. sampling in Definition 2.7 of the RFM, we expect this parametric rate to be sharp. However, the sample complexity bound n \gtrsim \varepsi - 4 from (2.32) is likely not sharp for fixed F \dagger . Indeed, it is a worst case bound <ref type="bibr">[27]</ref> that presumably can be improved to n \gtrsim \varepsi - (2+\delta ) for some small \delta &gt; 0 under stronger assumptions; see, e.g., <ref type="bibr">[129]</ref> in the \scrY = R setting. Such `fast rates"" are empirically observable in numerical experiments. We remark that the constants in Theorem 2.12 were not optimized and could be improved. Additional refinements to Theorems 2.11 and 2.12 that account for discretization error, noisy output data, and smoothness misspecification may be found in <ref type="bibr">[92, sect. 3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Application to PDE Solution Operators.</head><p>In this section, we design the random feature maps \varphi : \scrX \times \Theta \rightar \scrY and measures \mu for the RFM approximation of two particular PDE parameter-to-solution maps: the evolution semigroup of the viscous Burgers equation in subsection 3.1 and the coefficient-to-solution operator for the Darcy problem in subsection 3.2. It is well known to kernel method practitioners that the choice of kernel (which in this work follows from the choice of (\varphi , \mu )) plays a central role in the quality of the function reconstruction. While our method is purely data-driven and requires no knowledge of the governing PDE, we take the view that any prior knowledge can, and should, be introduced into the design of (\varphi , \mu ). However, the question of how to automatically determine good random feature pairs for a particular problem or dataset, inducing data-adapted kernels, is open. The maps \varphi that we choose to employ are nonlinear in both arguments. We also detail the probability measure \nu on the input space \scrX for both of the PDE applications; this choice is crucial because while we desire the trained RFM to transfer to arbitrary out-of-distribution inputs from \scrX , we can in general only expect the learned map to perform well when restricted to inputs statistically similar to those sampled from \nu .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Burgers' Equation:</head><p>Formulation. The viscous Burgers equation in one spatial dimension is representative of the advection-dominated PDE problem class in some regimes; these time-dependent equations are not conservation laws due to the presence of small dissipative terms, but nonlinear transport still plays a central role in the evolution of solutions. The initial value problem we consider is where \varepsi &gt; 0 is the viscosity (i.e., diffusion coefficient) and we have imposed periodic boundary conditions. The initial condition a serves as the input and is drawn according to a Gaussian measure defined by a \sim \nu := N (0, \scrC ) , (3.2) with Mat\' ern-like covariance operator <ref type="bibr">[43,</ref><ref type="bibr">104]</ref> \scrC := \tau 2\alpha - d ( - \Delta + \tau 2 Id) - \alpha , <ref type="bibr">(3.3)</ref> where d = 1 and the negative Laplacian - \Delta is defined over the torus T 1 = [0, 1] per and restricted to functions which integrate to zero over T 1 . The hyperparameter \tau \geq 0 is an inverse length scale and \alpha &gt; 1/2 controls the regularity of the draw. Such a are almost surely H\" older and Sobolev regular with exponent up to \alpha -1/2 [38, Thm. 12, p. 338], so in particular a \in \scrX := L 2 (T 1 ; R). Then, for all \varepsi &gt; 0, the unique global solution u(t, \cdot ) to (3.1) is real analytic for all t &gt; 0 [82, Thm. 1.1]. Hence, setting the output space to be \scrY := H s (T 1 ; R) for any s &gt; 0, we may define the solution map</p><p>where \{ \Psi t \} t&gt;0 forms the solution operator semigroup for (3.1) and we fix the final time t = T &gt; 0. The map F \dagger is smoothing and nonlinear.</p><p>We now describe a random feature map for use in the RFM (2.19) that we call Fourier space random features. Let \scrF denote the Fourier transform over spatial domain T 1 and define \varphi : \scrX \times \Theta \rightar \scrY by \varphi (a; \theta ) := \sigma \bigl( \scrF - 1 (\chi \scrF a\scrF \theta ) \bigr) , <ref type="bibr">(3.5)</ref> where \sigma ( \cdot ), the ELU function defined below, is defined as a mapping on R and applied pointwise to functions. Considering \Theta \subsete L 2 (T 1 ; R), the randomness enters through \theta \sim \mu := N (0, \scrC \prime ) with \scrC \prime the same covariance operator as in (3.3) but with potentially different inverse length scale and regularity, and the wavenumber filter function \chi : Z \rightar R \geq 0 is given for k \in Z by \chi (k) := \sigma \chi (2\pi | k| \delta ) , where \sigma \chi (r) := max \Bigl( 0, min \bigl( 2r, (r + 1/2) - \beta \bigr) \Bigr) , (3.6) \delta &gt; 0, and \beta &gt; 0. The map \varphi ( \cdot ; \theta ) essentially performs a filtered random convolution with the initial condition. Figure <ref type="figure">2</ref>(a) illustrates a sample input and output from \varphi . Although simply hand-tuned for performance and not optimized, the filter \chi is designed to shuffle energy in low to medium wavenumbers and cut off high wavenumbers (see Figure <ref type="figure">2(b)</ref>), reflecting our prior knowledge of solutions to <ref type="bibr">(3.1)</ref>.</p><p>We choose the activation function \sigma in (3.5) to be the exponential linear unit</p><p>The ELU function has been used successfully as activation in other machine learning frameworks for related nonlinear PDE problems <ref type="bibr">[94,</ref><ref type="bibr">118,</ref><ref type="bibr">119]</ref>. We also find ELU( \cdot ) to perform better in the RFM framework than several other choices including ReLU( \cdot ), tanh( \cdot ), sigmoid( \cdot ), sin( \cdot ), SELU( \cdot ), and softplus( \cdot ). Note that the pointwise evaluation of the ELU function in (3.5) will be well defined, by Sobolev embedding, for s &gt; 1/2 sufficiently large in the definition of \scrY = H s . Since the solution operator maps into H s for any s &gt; 0, this does not constrain the method.   <ref type="bibr">[61]</ref> arise in a variety of applications, in particular, the groundwater flow in a porous medium governed by Darcy's law <ref type="bibr">[13]</ref>. This linear elliptic boundary value problem reads</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Darcy Flow: Formulation. Divergence form elliptic equations</head><p>where D is a bounded open subset in R d , f represents sources and sinks of fluid, a the permeability of the porous medium, and u is the piezometric head; all three functions map D into R and, in addition, a is strictly positive almost everywhere in D. We work in a setting where f is fixed and consider the input-output map defined by a \mapsto \rightar u. The measure \nu on a is a high contrast level set prior constructed as the pushforward of a Gaussian measure:</p><p>Here \psi : R \rightar R is a threshold function defined for r \in R by \psi (r) := a + 1 (0,\infty ) (r) + a -1 ( - \infty ,0) (r) , where 0 &lt; a -\leq a + &lt; \infty , (3.10) applied pointwise to functions, and the covariance operator \scrC is given in (3.3) with d = 2 and homogeneous Neumann boundary conditions on - \Delta . That is, the resulting coefficient a almost surely takes only two values (a + or a -) and, as the zero level set of a Gaussian random field, exhibits random geometry in the physical domain D. It follows that a \in L \infty (D; R \geq 0 ) almost surely. Further, the size of the contrast ratio a + /a -measures the scale separation of this elliptic problem and hence controls the difficulty of reconstruction <ref type="bibr">[17]</ref>. See Figure <ref type="figure">3</ref>(a) for a representative sample.</p><p>Given f \in L 2 (D; R), the standard Lax--Milgram theory may be applied to show that for coefficient a \in \scrX := L \infty (D; R \geq 0 ), there exists a unique weak solution u \in \scrY := H 1 0 (D; R) for (3.8) (see, e.g., Evans <ref type="bibr">[50]</ref>). Thus, we define the ground truth Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> solution map F \dagger : L \infty \rightar H 1 0 , a \mapsto \rightar F \dagger (a) := u . <ref type="bibr">(3.11)</ref> Although the PDE (3.8) is linear, the solution map F \dagger is nonlinear.</p><p>We now describe the chosen random feature map for this problem, which we call predictor-corrector random features. Define \varphi : \scrX \times \Theta \rightar \scrY by \varphi (a; \theta ) := p 1 such that</p><p>where the boundary conditions are homogeneous Dirichlet, \theta = (\theta 1 , \theta 2 ) \sim \mu := \mu \prime \times \mu \prime are two Gaussian random fields each drawn from \mu \prime := N (0, \scrC \prime ), f is the source term in <ref type="bibr">(3.8)</ref>, and \gamma = (s + , s -, \delta ) are parameters for a thresholded sigmoid \sigma \gamma : R \rightar R,</p><p>and extended as a Nemytskii operator when applied to \theta 1 ( \cdot ) or \theta 2 ( \cdot ). We consider \Theta \subsete L 2 (D; R) \times L 2 (D; R). In practice, since \nabla a is not well defined when drawn from the level set measure, we replace a with a \varepsi , where a \varepsi := v(1, \cdot ) is a smoothed version of a obtained by evolving the following linear heat equation for one time unit:</p><p>where \sansn is the outward unit normal vector to \partialD. An example of the response \varphi (a; \theta ) to a piecewise constant input a \sim \nu is shown in Figure <ref type="figure">3</ref> for some \theta \sim \mu .</p><p>We remark that by removing the two random terms involving \theta 1 and \theta 2 in (3.12), we obtain a remarkably accurate surrogate model for the PDE. This observation is representative of a more general iterative method, a predictor-corrector type iteration, for solving the Darcy equation (3.8), whose convergence depends on the size of a. The map \varphi is essentially a random perturbation of a single step of this iterative method: (3.12a) makes a coarse prediction of the output, then (3.12b) improves this prediction with a correction term derived from expanding the original PDE. This choice of \varphi falls within an ensemble viewpoint that the RFM may be used to improve preexisting surrogate models by taking \varphi ( \cdot ; \theta ) to be an existing emulator, but randomized in a principled way through \theta \sim \mu .</p><p>For this particular example, we are cognizant of the facts that the random feature map \varphi requires full knowledge of the Darcy equation and a naive evaluation of \varphi may be as expensive as solving the original PDE, which is itself linear; however, we believe that the ideas underlying the random features used here are intuitive and suggestive of what is possible in other application areas. For example, RFMs may be applied on larger domains with simple geometries, viewed as supersets of the physical domain of interest, enabling the use of efficient algorithms such as the fast Fourier transform (FFT) even though these may not be available for the original problem, either because the operator to be inverted is spatially inhomogeneous or because of the complicated geometry of the physical domain.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Numerical Experiments.</head><p>We now assess the performance of our proposed methodology on the approximation of operators F \dagger : \scrX \rightar \scrY presented in section 3. Practical implementation of the approach on a computer necessitates discretization of the input-output function spaces \scrX and \scrY . Hence, in the numerical experiments that follow, all infinite-dimensional objects such as the training data, evaluations of random feature maps, and random fields are discretized on an equispaced mesh with K grid points to take advantage of the O(K log K) computational speed of the FFT. The simple choice of equispaced points does not limit the proposed approach, as our formulation of the RFM on function space allows the method to be implemented numerically with any choice of spatial discretization. Such a numerical discretization procedure leads to the problem of high-but finite-dimensional approximation of discretized target operators mapping R K to R K by similarly discretized RFMs. However, we emphasize the fact that K is allowed to vary, and we study the properties of the discretized RFM as K varies, noting that since the RFM is defined conceptually on function space in section 2 without reference to discretization, its discretized numerical realization has approximation quality consistent with the infinite-dimensional limit K \rightar \infty . This implies that the same trained model can be deployed across the entire hierarchy of finite-dimensional spaces R K parametrized by K \in N without the need to be retrained, provided K is sufficiently large. Thus, in this section, our notation does not make explicit the dependence of the discretized RFM or target operators on mesh size K. We demonstrate these claimed properties numerically.</p><p>The input functions and our chosen random feature maps (3.5) and (3.12) require i.i.d. draws of Gaussian random fields to be fully defined. We efficiently sample these fields by truncating a Karhunen--Lo\' eve expansion and employing fast summation of the eigenfunctions with FFTs. More precisely, on a mesh of size K, denote by g(\cdot ) a numerical approximation of a Gaussian random field on domain D = (0, 1) d , d = 1, 2:</p><p>\xi k \prime \sqrt{} \lambda k \prime \phi k \prime \sim N (0, \scrC ) , (4.1) Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> where \xi j \sim N (0, 1) i.i.d. for each j and Z K \subset Z \geq 0 is a truncated one-dimensional lattice of cardinality K ordered such that \{ \lambda j \} is nonincreasing. The pairs (\lambda k \prime , \phi k \prime ) are found by solving the eigenvalue problem \scrC \phi k \prime = \lambda k \prime \phi k \prime for nonnegative, symmetric, trace-class operator \scrC (3.3). Concretely, these solutions are given by</p><p>for homogeneous Neumann boundary conditions when d = 2, k \prime = (k \prime 1 , k \prime 2 ) \in Z 2 \geq 0 \setminu \{ 0\} , x = (x 1 , x 2 ) \in (0, 1) 2 . They are given by \phi 2j (x) = \surd 2 cos(2\pi jx) , \phi 2j - 1 (x) = \surd 2 sin(2\pi jx) , \phi 0 (x) = 1 , (4.3a)</p><p>for periodic boundary conditions when d = 1, j \in Z &gt;0 , and x \in (0, 1). In both cases, we enforce that g integrates to zero over D by manually setting to zero the Fourier coefficient corresponding to multi-index k \prime = 0. We use such a g in all experiments that follow. Additionally, the k and k \prime used in this section to denote wavenumber indices should not be confused with our previous notation for kernels.</p><p>With the discretization and data generation setup now well defined, and the pairs (\varphi , \mu ) given in section 3, the last algorithmic step is to train the RFM by solving (2.25) and then test its performance. For a fixed number of random features m, we only train and test a single realization of the RFM, viewed as a random variable itself. In each instance m is varied in the experiments that follow, and the draws \{ \theta j \} m j=1 are resampled i.i.d. from \mu . To measure the distance between the trained RFM F m (\cdot ; \widehat \alpha ) and the ground truth F \dagger , we employ the approximate expected relative test error</p><p>e n \prime ,m := 1 n \prime n \prime \sum j=1 \| F \dagger (a \prime j ) -F m (a \prime j ; \widehat \alpha )\| L 2 \| F \dagger (a \prime j )\| L 2 \approx E a \prime \sim \nu \biggl[ \| F \dagger (a \prime ) -F m (a \prime ; \widehat \alpha )\| L 2 \| F \dagger (a \prime )\| L 2 \biggr] , (4.4)</p><p>where the \{ a \prime j \} n \prime j=1 are drawn i.i.d. from \nu and n \prime denotes the number of input-output pairs used for testing. All L 2 (D; R) norms on the physical domain are numerically approximated by composite trapezoid rule quadrature. Since \scrY \subset L 2 for both the PDE solution operators (3.4) and <ref type="bibr">(3.11)</ref>, we also perform all required inner products during training in L 2 rather than in \scrY ; this results in smaller relative test error e n \prime ,m .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Burgers' Equation:</head><p>Experiment. We generate a high resolution dataset of input-output pairs by solving Burgers' equation (3.1) on an equispaced periodic mesh of size K = 1025 (identifying the first mesh point with the last) with random initial conditions sampled from \nu = N (0, \scrC ) using (4.1), where \scrC is given by (3.3) with parameter choices \tau = 7 and \alpha = 2.5. The full order solver is an FFT-based pseudospectral method for spatial discretization <ref type="bibr">[54]</ref> and a fourth order Runge--Kutta integrating factor time-stepping scheme for time discretization <ref type="bibr">[79]</ref>. All data represented on mesh sizes K &lt; 1025 used in both training and testing phases are subsampled from this original dataset, and hence we consider numerical realizations of F \dagger (3.4) up to R 1025 \rightar R 1025 . We fix n = 512 training and n \prime = 4000 testing pairs unless otherwise noted and also fix the viscosity to \varepsi = 10 - 2 in all experiments. Lowering \varepsi leads to smaller length scale solutions and more difficult reconstruction; more data (higher n)</p><p>Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> 0.0 0.2 0.4 0.6 0.8 1.0 x -0.4 -0.2 0.0 0.2 0.4 Input Test Truth (a) Input and output (b) Pointwise error and features (higher m) or a more expressive choice of (\varphi , \mu ) would be required to achieve comparable error levels due to the slow decaying Kolmogorov width of the solution map. For simplicity, we set the forcing f \equiv 0, although nonzero forcing could lead to other interesting solution maps such as f \mapsto \rightar u(T, \cdot ). It is easy to check that the solution will have zero mean for all time and a steady state of zero. Hence, we choose T \leq 2 to ensure that the solution is far enough away from steady state. For the random feature map (3.5), we fix the hyperparameters \alpha \prime = 2, \tau \prime = 5, \delta = 0.0025, and \beta = 4. The map itself is evaluated efficiently with the FFT and requires no other tools to be discretized. RFM hyperparameters are hand-tuned but not optimized. We find that regularization during training has a negligible effect for this problem, so the RFM is trained with \lambda = 0 by solving the normal equations (2.25) with the pseudoinverse to deliver the minimum norm least squares solution; we use the truncated SVD implementation in Python's scipy.linalg.pinv2 for this purpose.</p><p>Our experiments study the RFM approximation to the viscous Burgers equation evolution operator semigroup <ref type="bibr">(3.4)</ref>. As a visual aid for the high-dimensional problem at hand, Figure <ref type="figure">4</ref> shows a representative sample input and output along with a trained RFM test prediction. To determine whether the RFM has actually learned the correct evolution operator, we test the semigroup property of the map; <ref type="bibr">[150]</ref> pursues closely related work also in a Fourier space setting. Denote the (j -1)-fold composition of a function G with itself by G j . Then, with u(0, \cdot ) = a, we have (\Psi T \circ \cdot \cdot \cdot \circ \Psi T )(a) = \Psi j T (a) = \Psi jT (a) = u(jT, \cdot ) (4.5) by definition. We train the RFM on input-output pairs from the map \Psi T with T := 0.5 to obtain \widehat F := F m ( \cdot ; \widehat \alpha ). Then, it should follow from (4.5) that \widehat F j \approx \Psi jT ; that is, each application of \widehat F should evolve the solution T time units. We test this semigroup approximation by learning the map \widehat F and then comparing \widehat F j on n \prime = 4000 fixed inputs to outputs from each of the operators \Psi jT , with j \in \{ 1, 2, 3, 4\} (the solutions at time T , 2T , 3T , 4T ). The results are presented in Table <ref type="table">1</ref> for a fixed mesh size Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> Table <ref type="table">1</ref> Expected relative error e n \prime ,m for time upscaling with the learned RFM operator semigroup for Burgers' equation. Here, n \prime = 4000, m = 1024, n = 512, and K = 129. The RFM is trained on data from the evolution operator \Psi T =0.5 and then tested on input-output samples generated from \Psi jT , where j = 2, 3, 4, by repeated composition of the learned model. The increase in error is small even after three compositions, reflecting excellent out-of-distribution performance.</p><p>Train on: T = 0.5 Test on: 2T = 1.0 3T = 1.5 4T = 2.0 0.0360 0.0407 0.0528 0.0788 K = 129. We observe that the composed RFM map \widehat F j accurately captures \Psi jT , though this accuracy deteriorates as j increases due to error propagation in time, as is common with any traditional integrator. However, even after three compositions corresponding to 1.5 time units past the training time T = 0.5, the relative error only increases by around 0.04. It is remarkable that the RFM learns time evolution without explicitly time-stepping the PDE (3.1) itself. Such a procedure is coined time upscaling in the PDE context and in some sense breaks the CFL stability barrier <ref type="bibr">[40]</ref>. Table <ref type="table">1</ref> is evidence that the RFM has excellent out-of-distribution performance: although only trained on inputs a \sim \nu , the model outputs accurate predictions given new input samples \Psi jT (a) \sim (\Psi jT ) \sharp \nu .</p><p>We next study the ability of the RFM to transfer its learned coefficients \widehat \alpha obtained from training on mesh size K to different mesh resolutions K \prime in Figure <ref type="figure">5(a)</ref>. We fix T := 1 in what follows and observe that the lowest test error occurs when K = K \prime , that is, when the train and test resolutions are identical; this behavior was also observed in the contemporaneous work <ref type="bibr">[97]</ref>. At very low resolutions, such as K = 17 here, the test error is dominated by discretization error which can become quite large; for example, resolving conceptually infinite-dimensional objects such as the Fourier space based feature map in <ref type="bibr">(3.5)</ref> or the L 2 norms in (4.4) with only 17 grid points gives bad accuracy. However, outside this regime, the errors are essentially constant across resolution regardless of the training resolution K, indicating that the RFM learns its optimal coefficients independently of the resolution and hence generalizes well to any desired mesh size. In fact, the trained model could be deployed on different discretizations of the domain D (e.g., various choices of finite elements, graph-based/particle methods), not just with different mesh sizes. Practically speaking, this means that high resolution training sets can be subsampled to mesh sizes K that are smaller (yet still large enough to avoid large discretization error) for faster training, leading to a trained model with nearly the same accuracy at all higher resolutions.</p><p>The smallest expected relative test error achieved by the RFM is 0.0303 for the configuration in Figure <ref type="figure">5(b)</ref>. This excellent performance is encouraging because the error we report is of the same order of magnitude as that reported in [96, sect. 5.1] for the same Burgers solution operator that we study, but with slightly different problem parameter choices. We emphasize that the neural operator methods in that work are based on deep learning, which involves training NNs by solving a nonconvex optimization problem with stochastic gradient descent, while our random feature methods have orders of magnitude fewer trainable parameters that are easily optimized through convex optimization. In Figure <ref type="figure">5</ref>(b), we see that for large enough n, the error empirically follows the O(m - 1/2 ) parameter complexity bound that is suggested by Theorem 2.12. This theorem does not directly apply here because it requires the regularization parameter \lambda to be strictly positive and F \dagger to be in the RKHS of (\varphi , \mu ) 0 200 400 600 800 1000 Resolution 0.030 0.035 0.040 0.045 0.050 0.055 0.060 Expected Relative Test Error m = 256 m = 512 m = 1024 (a) Test error vs. resolution 0 100 200 300 400 500 Resolution 10 -3 10 -2 10 -1 &#945; (K) -&#945; (1025) 2 / &#945; (1025) 2 m = 256 m = 512 m = 1024 (b) Minimizer error vs. resolution Fig. 6. Results of a trained RFM for the Burgers equation evolution operator F \dagger = \Psi 1 : (a) shows resolution-invariant test error for various m. (b) displays the relative error of the learned coefficient \alpha w.r.t. the coefficient learned on the highest mesh size (K = 1025). Here, n = 512 training and n \prime = 4000 testing pairs were used.</p><p>from subsection 3.1, which we do not verify. Nonetheless, Figure <ref type="figure">5</ref>(b) indicates that the error bounds for the trained RFM hold for a larger class of problems than the stated assumptions suggest.</p><p>Finally, Figure <ref type="figure">6</ref> demonstrates the invariance of the expected relative test error to the mesh resolution used for training and testing. This result is a consequence of framing the RFM on function space; other machine learning--based surrogate methods defined in finite dimensions exhibit an increase in test error as mesh resolution is Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> increased (see <ref type="bibr">[19, sect. 4]</ref> for a numerical account of this phenomenon). Figure <ref type="figure">6(a)</ref> shows the error as a function of mesh resolution for three values of m. For very low resolution, the error varies slightly but then flattens out to a constant value as K \rightar \infty . Figure <ref type="figure">6</ref>(b) indicates that the learned coefficient \alpha (K) for each K converges to some \alpha (\infty ) as K \rightar \infty , again reflecting the design of the RFM as a mapping between infinite-dimensional spaces. 4.2. Darcy Flow: Experiment. In this section, we consider Darcy flow on the physical domain D := (0, 1) 2 , the unit square. We generate a high resolution dataset of input-output pairs for F \dagger (3.11) by solving (3.8) on an equispaced 257\times 257 mesh (size K = 257 2 ) using a second order finite difference scheme. All mesh sizes K &lt; 257 2 are subsampled from this original dataset and hence we consider numerical realizations of F \dagger up to R 66049 \rightar R 66049 . We denote resolution by r such that K = r 2 . We fix n = 128 training and n \prime = 1000 testing pairs unless otherwise noted. The input data are drawn from the level set measure \nu (3.9) with \tau = 3 and \alpha = 2 fixed. We choose a + = 12 and a -= 3 in all experiments that follow and hence the contrast ratio a + /a -= 4 is fixed. The source is fixed to f \equiv 1, the constant function. We evaluate the predictor-corrector random features \varphi (3.12) using an FFT-based fast Poisson solver corresponding to an underlying second order finite difference stencil at a cost of O(K log K) per solve. The smoothed coefficient a \varepsi in the definition of \varphi is obtained by solving (3.14) with time step 0.03 and diffusion constant \eta = 10 - 4 ; with centered second order finite differences, this incurs 34 time steps and hence a cost O(34K). We fix the hyperparameters \alpha \prime = 2, \tau \prime = 7.5, s + = 1/12, s -= - 1/3, and \delta = 0.15 for the map \varphi . Unlike in subsection 4.1, we find via grid search on \lambda that regularization during training does improve the reconstruction of the Darcy flow solution operator and hence we train with \lambda := 10 - 8 fixed. We remark that, for simplicity, the above hyperparameters were not systematically and jointly optimized; as a consequence the RFM performance has room to improve beyond the results in this section.</p><p>Darcy flow is characterized by the geometry of the high contrast coefficients a \sim \nu . As seen in Figure <ref type="figure">7</ref>, the solution inherits the steep interfaces of the input. However, we see that a trained RFM with predictor-corrector random features (3.12) captures these interfaces well, albeit with slight smoothing; the error concentrates on the location of the interface. The effect of increasing m and n on the test error is shown in Figure <ref type="figure">8(b)</ref>. Here, the error appears to saturate more than was observed for the Burgers equation problem (Figure <ref type="figure">5</ref>(b)) and does not follow the O(m - 1/2 ) rate. This is likely due to our fixing \lambda to be constant instead of scaling it with m as suggested by Theorem 2.12. It is also possible that the Darcy flow solution map does not belong to the RKHS \scrH k\mu , leading to an additional misspecification error. However, the smallest test error achieved for the best performing RFM configuration is 0.0381, which is on the same scale as the error reported in competing neural operator--based methods <ref type="bibr">[19,</ref><ref type="bibr">97]</ref> for the same setup.</p><p>The RFM is able to be successfully trained and tested on different resolutions for Darcy flow. Figure <ref type="figure">8</ref>(a) shows that, again, for low resolutions, the smallest relative test error is achieved when the train and test resolutions are identical (here, for r = 17). However, when the resolution is increased away from this low resolution regime, the relative test error slightly increases then approaches a constant value, reflecting the function space design of the method. Training the RFM on a high resolution mesh poses no issues when transferring to lower or higher resolutions for model evaluation, and it achieves consistent error for test resolutions sufficiently large (i.e., r \geq 33, the Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> </p><p>(a) Truth (b) Approximation (c) Input (d) Pointwise error regime where discretization error starts to become negligible). Additionally, the RFM basis functions \{ \varphi ( \cdot ; \theta j )\} m j=1 are defined without any dependence on the training data, unlike in other competing approaches based on similar shallow linear approximations, such as the reduced basis method or the PCA-Net method in <ref type="bibr">[19]</ref>. Consequently, our RFM may be directly evaluated on any desired mesh resolution once trained (``superresolution""), whereas those aforementioned approaches require some form of interpolation to transfer between different mesh sizes (see <ref type="bibr">[19, sect. 4.3]</ref>).</p><p>In Figure <ref type="figure">9</ref>, we again confirm that our method is invariant to the refinement of the mesh and improves with more random features. While the difference at low resolutions is more pronounced than that observed for the Burgers equation, our results for Darcy flow still suggest that the expected relative test error converges to a constant value as resolution increases; an estimate of this rate of convergence is seen in Figure <ref type="figure">9</ref> 0 25 50 75 100 125 Resolution 0.040 0.045 0.050 0.055 0.060 0.065 0.070 Expected Relative Test Error m = 64 m = 128 m = 256 (a) Test error vs. resolution 0 20 40 60 Resolution 10 -2 10 -1 10 0 &#945; (r) -&#945; (129) 2 / &#945; (129) 2 m = 64 m = 128 m = 256 (b) Minimizer error vs. resolution </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion.</head><p>This paper introduces a random feature methodology for the data-driven estimation of operators mapping between infinite-dimensional Banach spaces. It may be interpreted as a low-rank approximation to operator-valued kernel ridge regression. Training the function-valued random features only requires solving a quadratic optimization problem for an m-dimensional coefficient vector. The conceptually infinite-dimensional algorithm is nonintrusive and results in a scalable method that is consistent with the continuum limit, robust to discretization, and Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see <ref type="url">https://epubs.siam.org/terms-privacy</ref> highly flexible in practical use cases. Numerical experiments confirm these benefits in scientific machine learning applications involving two nonlinear forward operators arising from PDEs. Backed by tractable training routines and theoretical guarantees, operator learning with the function-valued random features method displays considerable potential for accelerating many-query computational tasks and for discovering new models from high-dimensional experimental data in science and engineering.</p><p>Going beyond this paper, several directions for future research remain open. Some of the first theoretical results for function-valued random features are summarized in subsection 2.5. However, it is not yet known what conditions on the problem and the feature pair allow for faster rates of convergence. In addition, it is of interest to characterize the quality of the operator RKHS spaces induced by random feature pairs and whether practical problem classes actually belong to these spaces. Also of importance is the question of how to automatically adapt function-valued random features to data instead of manually constructing them. Some possibilities along this line of work include the Bayesian estimation of hyperparameters, as is frequently used in Gaussian process regression, or more general hierarchical learning of the random feature pair (\varphi , \mu ) itself. In tandem, there is a need for a mature function-valued random features software library that includes efficient linear solvers and GPU implementations, benchmark problems, and robust hyperparameter optimizers. These advances will further enable the random features method to learn from real-world data and solve challenging forward and inverse problems from the physical sciences, such as climate modeling and material modeling, with controlled computational complexity.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The regularization parameter \lambda in Theorem</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>2.11 and subsection 2.4 is equal to n times the regularization parameter that is discussed in<ref type="bibr">[92]</ref>, which is also denoted by the same symbol. Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>Downloaded 10/11/24 to 131.215.72.225 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy</p></note>
		</body>
		</text>
</TEI>
