<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Data-driven multiscale modeling of subgrid parameterizations in climate models</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/24/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10455950</idno>
					<idno type="doi"></idno>
					<title level='j'>ICLR Workshop on ML for Climate</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Karl Otness</author><author>Laure and Zanna</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Subgrid parameterizations, which represent physical processes occurring below the resolu- tion of current climate models, are an important component in producing accurate, long-term predictions for the climate. A variety of approaches have been tested to design these com- ponents, including deep learning methods. In this work, we evaluate a proof of concept illustrating a multiscale approach to this prediction problem. We train neural networks to predict subgrid forcing values on a testbed model and examine improvements in prediction accuracy that can be obtained by using additional information in both fine-to-coarse and coarse-to-fine directions.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Climate models, which simulate the long-term evolution of the Earth's atmosphere, oceans, and terrestrial weather, are critical tools for projecting the impacts of climate change around the globe. Due to limits on available computational resources, these models must be run at a coarsened spatial resolution which cannot resolve all physical processes relevant to the climate system <ref type="bibr">[4]</ref>. To reflect the contribution of these subgrid-scale processes, closure models are added to climate models to provide the needed subgrid-scale forcing. These parameterizations model the contribution of these fine-scale dynamics and are critical to high quality and accurate long term predictions <ref type="bibr">[14,</ref><ref type="bibr">5]</ref>. A variety of approaches to designing these parameterizations have been tested, ranging from hand-designed formulations <ref type="bibr">[16]</ref>, to modern machine learning with genetic algorithms <ref type="bibr">[14]</ref>, or neural networks trained on collected snapshots <ref type="bibr">[17,</ref><ref type="bibr">7,</ref><ref type="bibr">12,</ref><ref type="bibr">13]</ref>, or in an online fashion through the target simulation <ref type="bibr">[6]</ref>.</p><p>In this work we examine the impact of decomposing the problem of predicting subgrid forcings into prediction problems across scales. The problem of learning subgrid forcing is inherently multiscale; the subgrid dynamics which must be restored represent the impact of the subgrid and resolved scales on each other. Closure models for climate are designed to be resolution-aware <ref type="bibr">[9]</ref>, but even so existing deep learning subgrid models do not explicitly leverage the interactions between scales, leaving it to the neural networks to implicitly learn these relationships. Explicitly including this structure as part of a deep learning approach may help regularize the learned closure models and support learning in regimes with limited training data or in the presence of underlying uncertainty. We explore the impact of this decomposition, providing proof of concept results illustrating the potential of imposing this prediction structure on a simple fullyconvolutional neural network closure model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Approach</head><p>We consider the problem of learning subgrid forcings for an idealized fluid model. In particular we study a two layer quasi-geostrophic model as implemented in PyQG <ref type="bibr">[1]</ref> which we have ported to JAX<ref type="foot">foot_0</ref>  <ref type="bibr">[3]</ref>. In this model the variable of interest is the potential vorticity q which evolves in two layers each with two dimensions and periodic boundary conditions. The model can be evaluated with a configurable grid resolution and states can be ported to lower resolutions by coarse-graining and filtering. Further details of this model are included in Appendix A.</p><p>To generate ground truth data, we run the model at a very high ("true") resolution. This produces trajectories q true (t ) and time derivatives &#8706;q true (t )/&#8706;t . Next we generate training data at a high resolution by applying a coarsening and filtering operator C giving variables q C (q). Given nonlinearities in the model, this coarsening does not commute with the dynamics of the model. To correct for this we must apply a subgrid forcing term S:</p><p>Note that formally the forcing S is a function of the state q true . In a climate modeling application we do not have access to this variable and so we train a model f &#952; ( q) &#8776; S which may be stochastic.</p><p>We continue this process, introducing another downscaling<ref type="foot">foot_1</ref> operator D and upscaling D + . Taking q hr q as our high resolution samples, we produce low resolution samples q lr D(q hr ) and S lr D(S). For any of these quantities v we have a decomposition v = D + D v + details(v), where details(v) are the details removed by D. Our experiments thus involve three resolutions, from fine to coarse: a "true" resolution; a high resolution, hr; and a low resolution, lr. The closures S try to update hr to match the "true" resolution.</p><p>Just as predicting S from q true is fully deterministic, while predicting it from q hr involves uncertainty, we anticipate a similar trend to hold for D(S). In other words, predicting D(S) from q hr should be easier than predicting D(S) directly from q lr . Then, using this coarse-grained prediction D(S) as a foundation, we can learn to predict only the missing details and add them. This process splits the problem of predicting S into two phases: (1) a "downscale" prediction to form D(S), and (2) a "buildup" prediction combining q hr and D(S) to predict S, adding the missing details. This decomposition takes advantage of self-similarity in the closure problem to pass information between the coarse and fine scales and improve predictions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experiments</head><p>To test this approach to predicting subgrid forcings we compare the "downscale" and "buildup" processes discussed above against baselines without the additional information. In our experiments on the quasi-geostrophic model, data is generated at a "true" resolution of 256 &#215; 256, and high and low resolutions are selected from 128 &#215; 128, 96 &#215; 96, and 64 &#215; 64. When describing the neural network tasks below, subscripts hr and lr stand in for one of these three resolutions. These scales were chosen so that the system requires closure (there are sufficient dynamics below the grid-scale cutoff), but does not diverge <ref type="bibr">[14]</ref>. In the experiments that follow we test all combinations of distinct high and low resolutions. See Appendix A for model parameters and further details on the quasi-geostrophic data generation.  Our experiments are divided into two categories: a first set of separated experiments, where the predictions are in one direction across scales (either down or up scale alone), the neural networks are trained and evaluated separately, and all required inputs are provided by an oracle backed by a ground truth data set; and a second set of combined experiments where these networks are composed, eliminating the need for the data set oracle.</p><p>For all experiments we train a feedforward convolutional neural network to perform the prediction task, three copies of each network. These neural networks have one of two architectures, a "small" architecture used in related research <ref type="bibr">[7]</ref> and a "large" architecture with larger convolution kernels. Details of these experiments are provided below, results are included in Section 4, and information on the network architectures and the training procedure are included in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Separated Experiments</head><p>In these experiments, we train neural networks separately to predict quantities between different scales. In particular we train "downscale" networks which predict only the low-resolution components of the target forcing quantity while observing a high resolution state, and "buildup" networks which work in the opposite direction, predicting higher-resolution forcing details with access to the low-resolution forcing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Downscale Prediction</head><p>We compare the task of predicting S lr D(S hr ) with access to high resolution information q hr or restricted to low resolution q lr . This provides an estimate of the advantage gained by predicting the target forcing with access to details at a scale finer than that of the network's output. We train two networks f &#952; with the same architecture to perform one of two prediction tasks:</p><p>To ensure that the convolution kernels process information at the same spatial size, and differ only in the spectral scales included, we first upsample all inputs to the same fixed size using a spectral upscaling operator D + described in Appendix C. The full prediction process including the re-sampling operators is illustrated in Figure <ref type="figure">1</ref> and experimental results are included in Table <ref type="table">1</ref> discussed below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Buildup Prediction</head><p>We also test a prediction problem in the opposite direction, predicting finer-scale details with access to lower-resolution predictions, similar to a learned super-resolution process used in recent generative modeling works <ref type="bibr">[15,</ref><ref type="bibr">8]</ref>. We train neural networks:</p><p>and</p><p>where S hr -D + (S lr ) are the details of S hr which are not reflected in S lr . The additional input S lr is given by an oracle using ground truth data in the training and evaluation sets.</p><p>This experiment estimates the value in having a high-quality, higher-confidence prediction S lr , in addition to q hr , when predicting the details of S hr . That is, the experiment estimates the value in starting the prediction of S hr by first locking in a coarse-grained version of the target, and separately enhancing it with finer-scale features. The two prediction tasks are illustrated in Figure <ref type="figure">2</ref> and results are included in Table <ref type="table">2</ref>, discussed below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Combined Experiments</head><p>In these experiments, we combine the networks trained in the "downscale" and "buildup" experiments, passing the downscale prediction as an input to the buildup network. This removes the oracle providing lower resolution predictions used to train the separate networks. In each test, we choose at least two scale levels and first predict a coarsened version of the subgrid forcing at the lowest resolution, then gradually enhance it with missing scales using the buildup process discussed above. We test all valid subsets of our three scale levels. Results for these experiments are included in Table <ref type="table">3</ref>.</p><p>For these experiments we retrain new neural networks building out the training pipeline sequentially. That is, we first train the first downscale network, and then use that trained network to provide needed inputs for the subsequent buildup network. In this way, later networks see realistic inputs during training rather than unrealistically clean data from a training set oracle.</p><p>The training process is otherwise unmodified; each network is trained separately following the same training procedure as in the separated experiments. The only change is that some of the training inputs are provided by fixed neural networks earlier in the full pipeline.</p><p>For a combined prediction across two scales lr and hr, we predict S hr from only q hr following the procedure below:</p><p>The quantities S are approximate neural network outputs used in intermediate predictions. We can extend this to perform a prediction using all three of our scale levels. For concreteness we discuss this using the relevant dimension numbers, but these could of course be generalized for other settings. The three level cascaded procedure predicts S 128 from only q 128 by first performing a downscale prediction to a resolution of 64 &#215; 64 followed by two buildup steps:</p><p>In the above, the different scaling operators D, D + are distinguished by subscripts with arrows toward the scale of the result. Downscaling across two levels is denoted D 2 . Equation <ref type="formula">4</ref>and Equation 5 compose the prediction tasks described in Equation 2 and Equation <ref type="formula">3</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results</head><p>For each of the prediction tasks described in Section 3, we train three neural networks. Once trained, we evaluate their performance on a held-out evaluation set measuring performance with three metrics: a mean squared error (MSE), a relative &#8467; 2 loss, and a relative &#8467; 2 of the spectra of the predictions.</p><p>The MSE is a standard mean squared error evaluated over each sample and averaged. The other two metrics are derived from previous work evaluating neural network parameterizations <ref type="bibr">[13]</ref> (where they were called L rmse and L S ). These were originally designed to measure performance for stochastic subgrid forcings. Here we use the two metrics from that work which do not collapse to trivial results for deterministic models. These are defined as:</p><p>Rel &#8467; 2 S -S 2 S and Rel Spec &#8467; 2 sp(S) -sp( S) 2 sp(S) 2 <ref type="bibr">(6)</ref> where S is the true target forcing, S is a neural network prediction being evaluated, and sp is the isotropic power spectrum. See calc_ispec in PyQG for calculation details <ref type="bibr">[1]</ref>. Each of these three metrics is averaged across the same batch of 1024 samples selected at random from the set of held out trajectories in the evaluation set.</p><p>Table <ref type="table">1</ref> shows the results for the downscale experiments, comparing against "across" prediction which accesses only coarse-scale information. In these results we observe an advantage to performing the predictions with access to higher-resolution data (the "downscale" columns), suggesting potential advantages and a decrease in uncertainty in such predictions.</p><p>Results for experiments examining prediction in the opposite direction-predicting a highresolution forcing with access to a low-resolution copy of the target from an oracle-are included in Table <ref type="table">2</ref>. We also observe an advantage in this task from having access to the additional information. The low resolution input in the buildup experiments yields lower errors on average at evaluation. This advantage is greater when the additional input is closer in scale to the target output. The predictions building up from 96 &#215; 96 to 128 &#215; 128 have lower errors than those which access an additional 64 &#215; 64 input. This is not unexpected given that the input with nearer resolution resolves more of the target value, leaving fewer details which need to be predicted by the network. The results for both separated experiments (those reported in Table <ref type="table">1</ref> and Table <ref type="table">2</ref>) for the MSE metric are illustrated in Figure <ref type="figure">3</ref>.   <ref type="table">3</ref>. The bars behind each cluster of points show the mean (horizontal bar in the center) along with a 2&#963; range on either side, computed from the empirical standard deviations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results for the combined experiments, reported in</head><p>performance, particularly in cases with a smaller scale step down. The improvement is particularly pronounced for the small size networks, suggesting there may be potential to improve architectural efficiency using a multiscale prediction process.</p><p>We find that adding additional buildup steps slightly hurt performance relative to a single lower scale level. We expect this may be due to accumulated errors propagated between the networks. Adding a small amount of noise during the sequential training process may help improve robustness. The points contributing to the MSE averages in Table <ref type="table">3</ref> are illustrated in Figure <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>Our proof of concept experiments in this work illustrate the potential advantages from decomposing the subgrid forcing problem into one across scales. We see this as an approach which may have regularization advantages, explicitly representing multiscale aspects of this prediction problem, and supporting learning in scarce data regimes and better handling underlying uncertainty in this task.</p><p>The results reported here represent an initial step in an ongoing project. In our continuing work we will further investigate combining these prediction tasks to increase robustness to errors in intermediate predictions. We anticipate that adding noise and other perturbations during training may improve the performance of the combined experiments. We will further work to quantify the regularization benefits of this approach in limited-data regimes, and investigate other ways to structure the multiscale prediction task, in particular tailoring the neural network architecture to minimize computational cost and taking advantage of the multiscale prediction process to maintain prediction quality. We also plan to expand evaluation to include online tests, using the learned parameterizations across simulation time steps, and test generalization to other quasi-geostrophic simulation parameter settings, and on other climate-modeling tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Quasi-Geostrophic Model</head><p>For our experiments we target the two-layer quasi-geostrophic model implemented in PyQG which is a simplified approximation of fluid dynamics <ref type="bibr">[1]</ref>. This model follows the evolution of a potential vorticity q, divided into two layers q = [q 1 , q 2 ]. This system is pseudo-spectral and has periodic boundaries along the edges of each layer. The evolution of the quantities in Fourier space (indicated by a hat) is:</p><p>&#8706; q2 &#8706;t = -J (&#968; 2 , q 2 ) -i k&#946; 2 &#968;2 + r ek &#954; 2 &#968;2 + ssd <ref type="bibr">(8)</ref> where J (A, B ) A x B y -A y B x , "ssd" is a small scale dissipation, and the quantity &#968; is related to q by: -</p><p>The values &#954; are the radial wavenumbers k 2 + l 2 while k and l are wavenumbers in the zonal and meridional directions (the axis-aligned directions in our grid), respectively <ref type="bibr">[14]</ref>.</p><p>We use the "eddy" configuration from <ref type="bibr">[14]</ref> which sets the following values for model constants:</p><p>where H 1 , H 2 are the heights of each of the two layers of q and r d is a deformation radius. For more information on the model configuration, consult <ref type="bibr">[14]</ref> and the documentation for the PyQG package.</p><p>We generate our data at a "true" resolution on a grid of dimension 256 &#215; 256 using the PyQG default third order Adams-Bashforth method for time stepping. We use a time step of &#8710;t = 3 600 generating 86 400 steps from which we keep every eighth leaving 10 800 per trajectory. Our training set consists of 100 such trajectories, and our evaluation set contains 10.</p><p>Each step produces a ground truth potential vorticity q true along with a spectral time derivative &#8706; qtrue /&#8706;t . From these we apply our family of coarsening operators C (described in Appendix C) to produce filtered and coarsened values q lr C lr (q true ) at resolutions of 128 &#215; 128, 96 &#215; 96, and 64 &#215; 64.</p><p>For each of these, we recompute spectral time derivatives in a coarsened PyQG model &#8706; qlr /&#8706;t , and we pass each time derivative to spatial variables and compute the target forcing for this scale:</p><p>These forcings-at each of the three scales-along with the high resolution variables are stored in the training and evaluation sets for each step.</p><p>networks for 96 epochs. We store the network weights which produced the lowest training set loss and use these for evaluation.</p><p>For all input and target data, we compute empirical means and standard deviations and standardize the overall distributions by these values before passing them to the network. The means and standard deviations from the training set are used in evaluation as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Coarsening Operators</head><p>In this work we make use of two families of coarsening operators to transform system states across scales. The first, denoted C , is used when generating our data. This operator is applied to the "true" resolution system outputs q true and &#8706;q true /&#8706;t to produce training and evaluation set samples as well as target forcings S. The second operator D (with associated upscaling D + ) is applied as a part of each prediction task to adjust scales around the neural networks as needed. These are the operators referenced in Figure <ref type="figure">1</ref> and Figure <ref type="figure">2</ref>.</p><p>Each of these operators is built around a core spectral truncation operation, D. For an input resolution hr and an output resolution lr, this operator truncates the 2D-Fourier spectrum to the wavenumbers which are resolved at the output resolution, then spatially resamples the resulting signal for the target size lr. These operators also apply a scalar multiplication to adjust the range of the coarsened values. We define a ratio &#961; hr/lr.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Data Filtering</head><p>The data filtering operator C is "Operator 1" as described in <ref type="bibr">[14]</ref>. It is a combination of the truncation operator D with a spectral filter</p><p>where the filter F acts on the 2D-Fourier spectrum of the truncated value. F is defined in terms of the radial wavenumber &#954; = k 2 + l 2 where k and l are the wavenumbers in each of the two dimensions of the input. For an input v&#954; at radial wavenumber &#954; we define:</p><p>where &#8710;x lr L/lr (L is a system parameter; see Appendix A for details), and &#954; c (0.65&#960;)/&#8710;x lr is a cutoff wavenumber where decay begins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Rescaling Operator</head><p>For scale manipulations as part of our learned model we make use of a scaled spectral truncation operator. We define a downscaling operator D as well as an upscaling operator D + : because DD T = I . This operator omits the filtering F performed as part of coarsening operator C avoid numerical issues when inverting the additional spectral filtering.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The ported QG model is available at https://github.com/karlotness/pyqg-jax/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>We use "downscale" and "downscaling" to refer to coarsening a target variable, removing finer scales.</p></note>
		</body>
		</text>
</TEI>
