<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>FlowPM: Distributed TensorFlow implementation of the FastPM cosmological N-body solver</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/01/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10392015</idno>
					<idno type="doi">10.1016/j.ascom.2021.100505</idno>
					<title level='j'>Astronomy and Computing</title>
<idno>2213-1337</idno>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">C</biblScope>					

					<author>C. Modi</author><author>F. Lanusse</author><author>U. Seljak</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We present FlowPM, a Particle-Mesh (PM) cosmological N-body code implemented in Mesh-TensorFlow for GPU-accelerated, distributed, and di↵erentiable simulations. We implement and validate the accuracy of a novel multi-grid scheme based on multiresolution pyramids to compute large scale forces e ciently on distributed platforms. We explore the scaling of the simulation on large-scale supercomputers and compare it with corresponding python based PM code, finding on an average 10x speed-up in terms of wallclock time. We also demonstrate how this novel tool can be used for e ciently solving large scale cosmological inference problems, in particular reconstruction of cosmological fields in a forward model Bayesian framework with hybrid PM and neural network forward model. We provide skeleton code for these examples and the entire code is publicly available at https://github.com/modichirag/flowpm.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>N-Body simulations evolving billions of dark matter particles under gravitational forces have become essential tool for modern day cosmology data analysis and interpretation. They are required for accurate predictions of the large scale structure (LSS) of the Universe, probed by tracers such as galaxies, quasars, gas, weak gravitational lensing etc. While full N-body simulations can provide very precise predictions, they are extremely slow and computationally expensive. As a result they are usually complemented by fast approximated numerical schemes such as Particle-Mesh (PM) solvers <ref type="bibr">(Merz et al., 2005;</ref><ref type="bibr">Tassev et al., 2013;</ref><ref type="bibr">White et al., 2014;</ref><ref type="bibr">Feng et al., 2016;</ref><ref type="bibr">Izard et al., 2016)</ref>. In these approaches, the gravitational forces experienced by dark matter particles in the simulation volume are computed by Fast Fourier Transforms on a 3D grid. This makes PM simulations very computationally e cient and scalable, at the cost of approximating the interactions on small scales close to the resolution of the grid and ignoring the force calculation on sub-resolution scales.</p><p>The next generation of cosmological surveys such as Dark Energy Spectroscopic Instrument (DESI) <ref type="bibr">(DESI Collaboration et al., 2016)</ref>, the Rubin Observatory Legacy Survey of Space and Time (LSST) <ref type="bibr">(LSST Science Collaboration et al., 2009)</ref>, and others will probe LSS with multiple tracers over unprecedented scales and dynamic range. This will make both the modeling and the analysis challenging even with current PM simulations. To improve the modeling in terms of dynamic range and speed, many recent works have explored applications of machine learning, with techniques based on deep convolutional networks and generative models <ref type="bibr">(He et al., 2019;</ref><ref type="bibr">Kodi Ramanah et al., 2020)</ref>. However these approaches often only learn the mapping between input features (generally initial density field) and final observables and completely replace the correct underlying physical dynamics of the evolution with end-to-end modeling. As a result, the generalization of these models across redshifts (time scales), length scales (cosmological volume, resolution etc.), cosmologies and observables of interest is limited by the training dataset and needs to be explored exhaustively to account for all failure modes.</p><p>At the same time, there has been a considerable interest in exploring novel techniques to push simulations to increasingly smaller and non-Gaussian regimes for the purpose of cosmological analyses. Recent work has demonstrated that one of the most promising approaches to do this is with forward modeling frameworks and using techniques like simulations based inference or reconstruction of cosmological fields for analysis <ref type="bibr">(Seljak et al., 2017;</ref><ref type="bibr">Alsing et al., 2018;</ref><ref type="bibr">Cranmer et al., 2019)</ref>. However given the highly non-linear forward dynamics and multimillion dimensionality of the cosmological simulations, such approaches are only tractable if one has access to di&#8629;erentiable forward simulations <ref type="bibr">(Jasche and Wandelt, 2013;</ref><ref type="bibr">Wang et al., 2014;</ref><ref type="bibr">Seljak et al., 2017;</ref><ref type="bibr">Modi et al., 2018)</ref>.</p><p>To address both of these challenges, in this work we present FlowPM, a TensorFlow implementation of an N-Body PM solver. Written entirely in TensorFlow, these simulations are GPU-accelerated and hence faster than CPU-based PM simulations, while also being entirely di&#8629;erentiable end-to-end. At the same time, they encode the exact underlying dynamics for gravitational evolution (within a PM scheme), while still leav-ing room for natural synthesis with machine learning components to develop hybrid simulations. These hybrids could use machine learning to model sub-grid and non-gravitational dynamics otherwise not included in a PM gravity solver.</p><p>As they evolve billions of particles in order to simulate today's cosmological surveys, N-Body simulations are quite memory intensive and necessarily require distributed computing. In FlowPM, we develop such distributed computation with a novel model parallelism framework -Mesh-TensorFlow <ref type="bibr">(Shazeer et al., 2018)</ref>. This allows us to control distribution strategies and split tensors and computational graphs over multiple processors.</p><p>The primary bottleneck for large scale distributed PM simulations are distributed 3D-Fast Fourier Transforms (FFT), which incur large amounts of communications and make up for more than half the wall-clock time <ref type="bibr">(Modi et al., 2019a)</ref>. One way to overcome these issues is with by using a multigrid scheme for computing gravitational forces <ref type="bibr">(Suisalu and Saar, 1995;</ref><ref type="bibr">Merz et al., 2005;</ref><ref type="bibr">Harnois-D&#233;raps et al., 2013)</ref>. In these schemes, large scale forces are estimated on a coarse distributed grid and then stitched together with small scale force estimated locally. Following the same idea, we also propose a novel multi-grid scheme based on Multiresolution Pyramids <ref type="bibr">(Burt and Adelson, 1983;</ref><ref type="bibr">Anderson et al., 1984)</ref>, that does not require any fine-tuned stitching of scales, and benefits from highly optimized 3-D convolutions of TensorFlow.</p><p>In this first work, we have implemented the FastPM scheme of <ref type="bibr">Feng et al. (2016)</ref> for PM evolution with force resolution B = 1. Di&#8629;erent PM schemes, as well as other approximate simulations like Lagrangian perturbation theory fields <ref type="bibr">(Tassev et al., 2013;</ref><ref type="bibr">Kitaura et al., 2014;</ref><ref type="bibr">Stein et al., 2019)</ref>, are built upon the same underlying components, with interpolations toand-from PM grid and e cient FFTs to estimate gravitational forces being the key elements. Thus modifying FlowPM to include other PM schemes is a matter of implementation rather than methodology development. Therefore in this work, we will focus in detail on these underlying components while only giving the outline of the full algorithm. For further technical details on the choices and accuracy of the implemented FastPM scheme, we refer the reader to the original paper of <ref type="bibr">Feng et al. (2016)</ref>.</p><p>The structure of this paper is as follows -in Section 2 we outline the basic FastPM algorithm and discuss our multi-grid scheme for force estimation in Section 3. Then in Section 4, we introduce Mesh-TensorFlow. In Section 5, we build upon these and introduce FlowPM. We compare the scaling and accuracy of FlowPM with the python FastPM code in Section 6. At last, in Section 7, we demonstrate the e cacy of FlowPM with a toy example by combining it with neural network forward models and reconstructing the initial conditions of the Universe. We end with discussion in Section 8.</p><p>Throughout the paper, we use "grid" to refer to the particlemesh grid and "mesh" to refer to the computational mesh of Mesh-TF. Unless explicitly specified, every FFT referred throughout the paper is a 3D FFT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">The FastPM particle mesh N-body solver</head><p>Schematically, FastPM implements a symplectic series of Kick-Drift-Kick operations <ref type="bibr">(Quinn et al., 1997)</ref> to do the time integration for gravitational evolution following leapfrog integration, starting from the Gaussian initial conditions of the Universe to the observed large scale structures at the final time-step. In the Kick stage, we estimate the gravitational force on every particle and update their momentum p. Next, in the staggered Drift stage, we displace the particle to update their position (x) with the current velocity estimate. The drift and kick operators can thus be defined as <ref type="bibr">(Quinn et al., 1997)</ref> -</p><p>where D PM and K PM are scalar drift and kick factors and f (t r ) is the gravitational force. For the leapfrog integration implemented in FlowPM (and FastPM), these operators are modified to include the position and velocity at staggered half time-steps instead of the initial time (t 0 ) in kick and drift steps respectively.</p><p>Estimating the gravitational force is the most computationally intensive part of the simulation. In a PM solver, the gravitational force is calculated via 3D Fourier transforms. First, the particles are interpolated to a spatially uniform grid with a kernel W(r) to estimate the mass overdensity field at every point in space. The most common choice for the kernel, and the one we implement, is a linear window of unite size corresponding to the cloud-in-cell (CIC) interpolation scheme <ref type="bibr">(Hockney and Eastwood, 1988)</ref>. We then apply a Fourier transform to obtain the over-density field k in Fourier space. This field is related to the force field via a transfer function (rr 2 ).</p><p>Once the force field is estimated on the spatial grid, it is interpolated back to the position of every particle with the same kernel as was used to generate the density field in the first place.</p><p>There are various ways to write down the transfer function in a discrete Fourier space, as explored in detail in <ref type="bibr">Hockney and Eastwood (1988)</ref>. The simplest way of doing this is with a naive Green's function kernels (r 2 = k 2 ) and di&#8629;erentiation kernels (r = ik). In FlowPM however, we implement a finite di&#8629;erentiation kernel as used in FastPM with</p><p>where x 0 is the grid size and ! = kx 0 is the circular frequency that goes from ( &#8673;, &#8673;].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Multi-grid PM Scheme</head><p>Every step in PM evolution requires one forward 3D FFT to estimate the overdensity field in Fourier space, and three inverse FFTs to estimate the force component in each direction. The simplest way to implement 3D FFTs is as 3 individual 1D Fourier transforms in each direction <ref type="bibr">(Pippig, 2013)</ref>. However in a distributed implementation, where the tensor is distributed on di&#8629;erent processes, this involves transpose operations and expensive all-to-all communications in order to get the dimension being transformed on the same process. This makes these FFTs the most time-intensive step in the simulation <ref type="bibr">(Modi et al., 2019a)</ref>. There are several ways to make this force estimation more e cient, such as fast multipole methods <ref type="bibr">(Greengard and Rokhlin, 1987)</ref> and multi-grid schemes <ref type="bibr">(Merz et al., 2005;</ref><ref type="bibr">Harnois-D&#233;raps et al., 2013)</ref>. In FlowPM, we implement a novel version of the latter.</p><p>The basic idea of a multi-grid scheme is to use grids at di&#8629;erent resolutions, with each of them estimating force on di&#8629;erent scales which are then stitched together. Here we discuss this approach for a 2 level multi-grid scheme, since that is implemented in FlowPM and the extension to more levels is similar. The long range (large-scale) forces are computed on a global coarse grid that is distributed across processes. On the other hand, the small scale forces are computed on a fine, higher resolution grid that is local, i.e. it estimates short range forces on smaller independent sections that are hosted locally by individual processes. Thus, the distributed 3D FFTs are computed only on the coarse mesh while the higher resolution mesh computes highly e cient local FFTs.</p><p>This force splitting massively reduces the communication data-volume but at the cost of increasing the total number of operations performed. Thus while it scales better for large grids and highly distributed meshes, it can be excessive for small simulations. Therefore in FlowPM we implement both, a multi-grid scheme for large simulations, and the usual single-grid force scheme for small grid sizes, with the user having the freedom to choose between the two.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Multiresolution Pyramids</head><p>In this section we briefly discuss multiresolution pyramids that will later form the basis our multi-grid implementation for force estimation. As used in image processing and signal processing communities, image pyramids refer to multi-scale representations of signals and image built through recursive reduction operations on the original image <ref type="bibr">(Burt and Adelson, 1983;</ref><ref type="bibr">Anderson et al., 1984)</ref>. The reduction operation is a combination of -i) smoothing operation, often implemented by convolving the original image with a low-pass filter (G) to avoid aliasing e&#8629;ects by reducing the Nyquist frequency of the signal, and ii) subsampling or downsampling operation (S), usually by a factor of 2 along each dimensions, to compress the data. This operation can be repeatedly applied at every 'level' to generate a 'pyramid' structure of the low-pass filtered copies of the original image. Given a level g l of original image g 0 , the pyramid next level is generated by</p><p>where ? represents a convolution. If the filter G used is a Gaussian filter, the pyramid is called a Gaussian pyramid.</p><p>However, in our multi-grid force estimation, we are interested in decoupling the large and small scale forces. This decoupling can be achieved by building upon the multi-scale representation of Gaussian pyramids to construct independent band-pass filters with a 'Laplacian' pyramid 1 .In this case, one reverse the reduce operation by -i) Upsampling (U) the image at level g l+1 , generally by inserting zeros between pixels and then ii) interpolating the missing values by convolving it with the same filter G to create the expanded low-pass image g 0</p><p>A Laplacian representation at any level can then be constructed by taking the di&#8629;erence of the original and expanded low-pass filtered image at that level</p><p>In Figure <ref type="figure">1</ref>, we show the Gaussian and Laplacian pyramid constructed for an example image, along with the flow-chart for these operations. Lastly, given the image at the highest level of the pyramid (g N ) and the Laplacian representation at every level (L 0 ...L N 1 ), we can exactly recover the original image by recursively doing the series of following inverse transform-</p><p>In Figure <ref type="figure">2</ref>, we show the resulting fields when these operations are applied on the matter density field. We show the result of upsampling the low-pass filtered image and the high pass filtered image with the low-pass filtered component removed as is 1 Since we will not use the Gaussian kernel for smoothing, our pyramid structure is not a Laplacian pyramid in the strictest sense, but the underlying idea is the same.  done in Laplacian pyramid. The negligible residuals show that the reconstruction of the fields is exact. In Figure <ref type="figure">3</ref>, we show the power spectra of the corresponding fields to demonstrate which scales are contributed at every level. We will apply these operations in Section 5.2 to estimate forces on a two-level grid.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Mesh-TensorFlow</head><p>The last ingredient we need to build FlowPM is a model parallelism framework that is capable of distributing the N-body simulation over multiple processes. To this end, we adopt Mesh-TensorFlow <ref type="bibr">(Shazeer et al., 2018)</ref> as the framework to implement our PM solver. Mesh-TensorFlow is a framework for distributed deep learning capable of specifying a general class of distributed tensor computations. In that spirit, it goes beyond the simple batch splitting of data parallelism and generalizes to be able to split any tensor and associated computations across any dimension on a mesh of processors.</p><p>We only summarize the key component elements of Mesh-TensorFlow here and refer the reader to <ref type="bibr">Shazeer et al. (2018)</ref>  <ref type="foot">4</ref>for more details. These are -&#8226; a mesh, which is a n-dimensional array of processors with named dimensions.</p><p>&#8226; the dimensions of every tensor are also named. Depending on their name, these dimensions are either distributed/split across or replicated on all processors in a mesh. Hence every processor holds a slice of every tensor. The dimensions can thus be seen as 'logical' dimensions which will be split in the same manner for all tensors and operations. &#8226; a user specified computational layout to map which tensor dimension is split along which mesh dimension. The unspecified tensor dimensions are replicated on all processes. While the layout does not a&#8629;ect the results, it can a&#8629;ect the performance of the code.</p><p>The user builds a Mesh-TensorFlow graph of computations in Python and this graph is then lowered to generate the Ten-sorFlow graphs. It is at this stage that the defined tensor dimensions and operations are split amongst di&#8629;erent processes on the mesh based on the computational layout. Multi-CPU/GPU meshes are implemented with PlacementMeshImpl, i.e. a separate TensorFlow operations is placed on the di&#8629;erent devices, all in one big TensorFlow graph. In this case, the TensorFlow graph will grow with the number of processes. On the other hand, TPU meshes are implemented in with SimdMeshImpl wherein TensorFlow operations (and communication collectives) are emitted from the perspective of one core, and this same program runs on every core, relying on the fact that each core actually performs the same operations. Due to Placement Mesh Implementation, time for lowering also increases as the number of processes are increased. Thus currently, it is not infeasible to extend Mesh-TensorFlow code to a very large number of GPUs for very large computations. We expect such bottlenecks to be overcome in the near future. Returning to Mesh-TF computations, lastly, the tensors to be evaluated are exported to TF tensors in which case they are gathered on the single process.</p><p>Figure <ref type="figure">4</ref> shows an example code to illustrate these operations which form the starting elements of any Mesh-TensorFlow. The code also sets up context for FlowPM, naming the three directions of the PM simulation grid and explicitly specifying which direction is split along which mesh-dimension. Note that for a batch of simulations, computationally the most e cient splitting will almost always maximize the splits along the batch dimension. While FlowPM supports that splitting, we will hence-import mesh_tensorflow as mtf import tensorflow.compat.v1 as tf # Setup graph and associated mesh graph = mtf.Graph() mesh = mtf.Mesh(graph, "my_mesh") # Define named dimensions for defining a 3D volume x_dim, y_dim, z_dim = (mtf.Dimension("nx", 128), mtf.Dimension("ny", 128), mtf.Dimension("nz", 128)) batch_dim = mtf.Dimension("batch", 1) # Sample a batched random normal 3D volume field = mtf.random_normal <ref type="bibr">(mesh, [batch_dim, x_dim, y_dim, z_dim]</ref>)</p><p>Other mtf operations can be added here # Define a concrete implementation as a 2D mesh on 8 GPUs, # splitting nx and ny dimensions along rows and cols mesh_impl = mtf.placement_mesh_impl.PlacementMeshImpl( mesh_shape=[("row", 4), ("col", 2)], layout_rules=[("nx", "row"), ("ny", "col")], devices=["device:GPU:%d"%i for i in range( <ref type="formula">8</ref>)])</p><p>#Lower the graph onto computational mesh lowering = mtf.Lowering(graph, {mesh:mesh_impl}) # Retrieve mtf tensor as TensorFlow tensor tf_field = lowering.export_to_tf_tensor(field)</p><p># Evaluate graph with tf.Session as sess: sess.run(tf_field)</p><p>Figure <ref type="figure">4</ref>: Sample code to generate random normal field of grid size 128 on a 2D computational mesh of 8 GPUs with x and y directions split across 4 and 2 processors respectively forth fix the batch size to 1 since our focus is on model parallelism of large scale simulations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">FlowPM: FastPM implementation in Mesh-Tensorflow</head><p>Finally, in this section, we introduce FlowPM and discuss technical details about its implementation in Mesh-TensorFlow. We will not discuss the underlying algorithm of a PM simulation since it is similar in spirit to any PM code, and is exactly the same as FastPM <ref type="bibr">(Feng et al., 2016)</ref>. Rather we will focus on domain decomposition and the multi-grid scheme for force estimation which are novel to FlowPM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Domain decomposition and Halo Exchange</head><p>In Mesh-Tensorflow, every process has a slice of every tensor. Thus for our PM grid, where we split the underlying grid spatially along di&#8629;erent dimensions, every process will hold only the physical region and the particles corresponding to that physical region. However at every PM step, there is a possibility of particles moving in-and-out of any given slice of the physical region. At the same time, convolution operations on any slice need access to the field outside the slice when operating on the boundary regions. The strategy for communication between neighboring slices to facilitate these operations is called a haloexchange. We implement this exchange by padding every slice with bu&#8629;er regions of halo size (h) that are shared between the slices on adjacent processes along the mesh. The choice of halo size depends on two factors<ref type="foot">foot_1</ref> , apart from the increased memory cost. Firstly, the halo size should be large enough to ensure an accurate computation of smoothing operations, i.e. larger than half the smoothing kernel size. Secondly, to keep communications and book keeping to a minimum, in the current implementation particles are not transferred to neighbouring processes if they travel outside of the domain of a given process. This means that we require halo regions to be larger than the expected maximum displacement of a particle over the entire period of the simulation. For large scale cosmological simulations, this displacement is &#8672; 20 Mpc/h in physical size.</p><p>Our halo-exchange algorithm is outlined in Algorithm 1 and illustrated for a 1-dimension grid in Figure <ref type="figure">5</ref>. It is primarily motivated by the fact that Mesh-TensorFlow allows to implement independent, local operations simply and e ciently as slicewise operations along the dimensions that are not distributed. Thus we can implement all of the operations -convolutions, CIC interpolations, position and velocity updates for particles and local FFTs -on these un-split dimensions as one would for a typical single-process implementation without worrying about distributed strategies. To take advantage of this, we begin by reshaping our tensors from 3 dimensions (excluding batch dimension) to 6 dimensions such that the first three indices split the tensor according the computational layout and the last three indices are not split, but replicated across all the processes and correspond to the local slice of the grid on each process. This is shown in Figure <ref type="figure">5</ref>, where an original 1-D tensor of size N = 12 in split over nx = 2 processes. In the second panel, this is reshaped to a 2-D tensor with the first dimension nx = 2 split over processes and s x is the un-split dimension. The un-split dimension is then zero-padded with halo size resulting in the size along the un-split dimension to be s x = N//n x + 2h. This is followed by an exchange with the neighboring slices over the physical volume common to the two adjacent processes.</p><p>After the slicewise operation updates the local slices, the same steps are followed in reverse. The padded halo regions are exchanged again to combine updates in the overlapping volume from adjacent processes. Then the padded regions are dropped and the tensor is reshaped to the original shape.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Multi-grid Scheme with Multiresolution Pyramids</head><p>To implement the multi-grid scheme with multiresolution pyramids, we need a smoothing and a subsampling operation. While traditionally smoothing operations are performed in Fourier space in cosmology, here we are restricted to these operations in local, pixel space. However this leads to a trade-o&#8629; between the size of the smoothing kernel in pixel space, and how compact the corresponding low pass filter is in Fourier . Add updates in overlapping padded regions from neighboring processes end procedure procedure HaloExchangeOp(G, S licewiseOp)</p><p>. Strategy for operating slicewise op with halo exchange assert(G.shape The red and blue regions are to be split on two process, with light regions indicating the overlapping volume that will be exchanged. In the second panel, the tensor is reshaped with the split along the first, nx dimension and the second un-split s x that is zero-padded. In the third panel, the regions are exchanged (here under the assumption of periodic boundary conditions). space. Our main consideration is to ensure that for the lowresolution component of the pyramid is su ciently suppressed at half the Nyquist scale of the original field, so as to be able to downsample this low req field with minimum aliasing. Using larger kernels improves the Fourier drop-o&#8629; but using larger convolution kernels in pixel space implies an increased computational and memory cost of halo-exchanges to be performed at the boundaries.</p><p>In our experiments, we find that a 3-D bspline kernel of order 6 balances these trade-o&#8629;s well and allows us to achieve subpercent accuracy, as we show later. Furthermore in FlowPM, we take advantage of highly e cient TensorFlow ops and implement both, the smoothing and subsampling operations together by constructing a 3D convolution filter with this bspline kernel and convolving the underlying field with stride 2 convolutions. In our default implementation, we repeat these operations twice and our global mesh is 4x coarser than the higher resolution mesh. The full algorithm for this REDUCE operation is outlined in Algorithm 2. Based on the discussion in Section 3.1, we also outline the reverse operation with EXPAND method.</p><p>Finally we are in the position to outline the multi-grid scheme for force computation. This is outlined in Algorithm 3. Schematically, we begin by reducing the high resolution grid to generate the coarse grid. Then we estimate the force components in all three directions on both the grids, with the key di&#8629;erence that the coarse grid performs distributed FFT on a global mesh while for the high resolution grid, every process performs local FFTs in parallel on the locally available Algorithm 2 Methods for pyramid scheme procedure REDUCE(H, f = 2, n = 6, S = 2) . Downsample field H by convolving with a bspline kernel of order n consecutively f times with stride S K ( BSpline(n)</p><p>. Upsample field D by convolving with a bspline kernel of order n consecutively f times with stride S K ( BSpline(n) x 8 H ( D for i 2 0 . . . f do H ( TransposedConv3D(H, K, S ) end for return H end procedure grids. To recover the correct total force, we need to combine these long and the short range forces together. For this, we expand the long-range force grid back to the high resolution. At the same time, we remove the long-range component from the high-resolution grid with a reduce and expand operation, similar to the band-pass level generated in the Laplacian pyramid in Section 3.1. In the end, we combine these long and short range forces to recover the original force.</p><p>Note that in implementing a multi-grid scheme for PM evolution, only the force calculation in the kick step is modified, while the rest of the PM scheme remains the same and operates on the original high-resolution grid.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Scaling and Numerical Accuracy</head><p>In this section, we will discuss the numerical accuracy and scaling of these simulations. Since the novel aspects of FlowPM are a di&#8629;erentiable Mesh-Tensorflow implementation and multigrid PM scheme, our discussion will center around comparison with the corresponding di&#8629;erentiable python implementation of FastPM which shares the same underlying PM scheme. We do not discuss the accuracy with respect to high resolution N-Body simulations with correct dynamics, and refer the reader to <ref type="bibr">Feng et al. (2016)</ref> for details on such a comparison. Our tests are performed primarily on GPU nodes on Cori supercomputer at NERSC, with each node hosting 8 NVIDIA V100 ('Volta') GPUs each with with 16 GB HBM2 memory and connected with NVLink interconnect. We compare our results and scaling with the di&#8629;erentiable python code run on Cori-Haswell nodes. This consists of FastPM code for forward Algorithm 3 Force computation in Muti-grid pyramid method procedure ForceGrids(G, mode) . Estimate the component force grids with 3D FFT G ( FFT3D(G, mode) for i 2 x, y, x do</p><p>. Pyramid scheme to estimate force meshes</p><p>modeling, vmad<ref type="foot">foot_2</ref> for automated di&#8629;erentiation and abopt<ref type="foot">foot_3</ref> for optimization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Accuracy of the simulations</head><p>We begin by establishing the accuracy of FlowPM for both, mesh and the pyramid scheme with the corresponding FastPM simulation. We run both the simulations in a 400 Mpc/h box on 128 3 grid for 10 steps from z = 9 to z = 0 with force-resolution of 1. The initial conditions are generated at the 1st order in LPT. The FastPM simulation is run on a single process while the FlowPM simulations are run on di&#8629;erent mesh-splits to validate both the multi-process implementations. We use halo size of 16 grids cells for all configurations and find that increasing the halo size increases the accuracy marginally for all configurations. Moreover, given the more stringent memory constraints of GPU, we run the FastPM simulation with a 64-bit precision while the FlowPM simulations are run with 32-bit precision.</p><p>Figure <ref type="figure">6</ref> compares the two simulations at the level of the fields. For the FlowPM simulations, we show the configuration run on 4 GPUs, with the grid split in 2 along the x and y direction each. The pyramid implementation shows higher residuals than the single mesh implementation, but they are sub-percent in either case.</p><p>In Figure <ref type="figure">7</ref> where we compare the two simulations more quantitatively by measuring their clustering. To compare their 2-point functions, we measure the transfer function which is the ratio of the power spectra (P(k)) of two fields</p><p>and their cross correlation which compares the phases and hence is a measure of higher order clustering</p><p>with P ab being the cross power spectra of the two fields.</p><p>Both the transfer function and cross correlation are within 0.01% across all scales, with the latter starting to deviate marginally on small scales. Thus the choices made in terms of convolution filters for up-down sampling, halo size and other parameters in multi-grid scheme, though not extensively explored, are adequate to reach the requisite accuracy. anticipate this to be a grid resolution dependent statement, and the resolution we have chosen is fairly typical of cosmological simulations that will be run for the analysis of future cosmological surveys and the analytic methods that will most likely use these di&#8629;erentiable simulations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Scaling on Cori GPU</head><p>We perform the scaling tests on the Cori GPUs for varying grid and mesh sizes, and compare it against the scaling of the python implementation of FastPM. As mentioned previously, though the accuracy of the simulation is independent of the mesh layout, the performance can be dependent on it. Thus for the FlowPM implementation, we will look at the timing for di&#8629;erent layouts of our mesh. Di&#8629;erent symbols indicate di&#8629;erent layout for computational mesh, with legends indicating the number of splits along x-direction. For a given number of processors, we use divergent color-scheme with more splits along x-axis than y-axis shown in red and blue otherwise. We use the same gradient for identical overall configuration.</p><p>Since 3D FFTs are typically the most computationally expensive part of a PM simulation, we begin with their time-scaling. For the Mesh-TF implementation, where we have implemented these operations as low-level Mesh-TensorFlow operators, the time scaling is shown in Figure <ref type="figure">8</ref>. For small grids, the timing for FFT is almost completely dominated by the time spent in all-to-all communications for transpose operations. As a result, the scaling is poor with the timing increasing as we increase the number of processes. However for very large grids, i.e. N 512, the compute cost approaches communication cost for small number of process we see a slight dip upto 8 processes, which is the number of GPUs on a single node. However after that, further increase in number of processes leads to a significant increase in time due to inter-node communications. At the same time, as can be seen from di&#8629;erent symbols, unbalanced splits in mesh layout in x and y directions leads to worse performance than a balanced split, with the di&#8629;erence becoming noticeable for large grid sizes. In Figure <ref type="figure">9</ref>, we compare the timing of FFT with that implemented in FastPM which implements the PFFT algorithm for 3D FFTs <ref type="bibr">(Pippig, 2013)</ref>. The scaling for PFFT is close to linear with the increasing number of processes. However due to GPU accelerators, the Mesh-TF FFTs are still order of magnitudes faster in most cases, with only getting comparable for small grid sizes and large mesh sizes.</p><p>While the scaling of FFTs is sub-optimal, what is more relevant for us is the scaling of 1 PM step that involves other operations in addition to FFTs. Thus, next we turn to the scaling of generating initial conditions (ICs) for the cosmological simulation. Our ICs are generated as first-order LPT displacements <ref type="bibr">(Zeldovich displacements, Zel'Dovich (1970)</ref>). Schematically, the operations involved in this are outlined in Algorithm 4. It provides a natural test since this step involves all the operations that enter a single PM step i.e. kick, drift, force evaluation and interpolation.</p><p>We establish the scaling for this step in Figure <ref type="figure">10</ref> for both the Algorithm 4 Generating IC procedure Gen-IC(N, pk) g ( sample 3D normal random field of size N L ( Scale g by power spectrum pk F ( Estimate force at grid-points from L i.e. Force d ( Scale F to obtain displacement X ( Displace particles with d, i.e. Drift V ( Scale to obtain velocity of particles i.e. Kick ( Interpolate particles at X to obtain density return end procedure implementations in FlowPM-the single mesh as well as pyramid scheme. Firstly, note that since this step combines computationally intensive operations (like interpolation) with communication intensive operations (like FFT), the scaling for a PM step generally has the desirable slope with the timings improving as we increase the number of processes. There is, however, still a communication overhead as we move from GPUs on a single node to multi-nodes (see for grid size of 128). Secondly, as expected, given the increase in number of operations for the pyramid scheme as compared to single-mesh implementation the single-mesh implementation is more e cient for small grids. However its scaling is poorer than the pyramid scheme which is &#8672; 20% faster for grid size of N = 1024.</p><p>In Figure <ref type="figure">11</ref>, we compare this implementation with the python implementation. We find that FlowPM is alteast 10x faster than FastPM for the same number of processes, across all sizes (except the combination of smallest grid and largest mesh size). Moreover, since the both FlowPM and FastPM benefit with increasing the number of processes, we expect that it will be hard to beat the performance of FlowPM with simply increasing the number of processes in FastPM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Example application: Reconstruction of initial conditions</head><p>With the turn of the decade, the next generation of cosmological surveys will probe increasingly smaller scales, over the largest volumes, with di&#8629;erent cosmological probes. As a result, there is a renewed interest in developing forward modeling approaches and simulations based inference techniques to optimally extract and combine information from these surveys. Di&#8629;erentiable simulators such as FlowPM will make these approaches tractable in high-dimensional data spaces of cosmology and hence play an important role in the analysis of these future surveys.</p><p>Here we demonstrate the e cacy of FlowPM with one such analytic approach that has recently received a lot of attention. We are interested in reconstructing the initial conditions of the Universe (s) from the late-time observations (d) <ref type="bibr">(Wang et al., 2014;</ref><ref type="bibr">Jasche and Wandelt, 2013;</ref><ref type="bibr">Seljak et al., 2017;</ref><ref type="bibr">Modi et al., 2018;</ref><ref type="bibr">Schmittfull et al., 2018)</ref>. The late time matter-density field is highly non-Gaussian which makes it hard to model and do inference. Hence the current analysis methods only extract a fraction of the available information from cosmological surveys. the other hand, the initial conditions are known to be Gaussian to a very high degree and hence their power spectrum (S) is a su cient statistic. Thus reconstructing this initial density field provides a way of developing optimal approach for inference <ref type="bibr">(Seljak et al., 2017)</ref>. At the same time, the reconstructed initial field can be forward modeled to generate other latent cosmological fields of interest that can be used for supplementary analysis <ref type="bibr">(Modi et al., 2019b;</ref><ref type="bibr">Horowitz et al., 2019a)</ref>.</p><p>We approach this reconstruction in a forward model Bayesian framework based most directly on <ref type="bibr">Seljak et al. (2017)</ref>; <ref type="bibr">Modi et al. (2018</ref><ref type="bibr">Modi et al. ( , 2019b))</ref>. We refer the reader to these for technical details but briefly, we forward model (F ) the observations of interest from the initial conditions of the Universe (s) and compare them with the observed data (d) under a likelihood model -L(d|s). This can be combined with the Gaussian prior on the initial modes to write the corresponding posterior -P(s|d). This posterior is either optimized to get a maximum-aposteriori (MAP) estimate of the initial conditions, or alternatively it can be explored with sampling methods as in <ref type="bibr">Jasche and Wandelt (2013)</ref>. However in either approach, given the multimillion dimensionality of the observations, it is necessary to use gradient based approaches for optimization and sampling and hence a di&#8629;erentiable forward model is necessary. Particle mesh simulations evolving the initial conditions to the final matter field will form the first part of these forward models for most cosmological observables of interest. Here we demonstrate in action with two examples how the inbuilt di&#8629;erentiability and interfacing of FlowPM with TensorFlow makes developing such approaches natural. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Reconstruction from Dark Matter</head><p>In the first toy problem, consider the reconstruction of the initial matter density field (s) from the final Eulerian dark matter density field (d = e ) as an observable. While the 3D Eulerian dark matter field is not observed directly in any survey, and weak lensing surveys only allow us to probe the projected matter density field, this problem is illustrative in that the forward model is only the PM simulation i.e. F = FlowPM with the modeled density field m = F (s). We will include more realistic observation and complex modeling in the next example.</p><p>We assume a Gaussian uncorrelated data noise with constant variance 2 in configuration space. Then, the negative logposterior of the initial conditions is:</p><p>where the first term is the negative Gaussian log-likelihood and the second term is the negative Gaussian log-prior on the power spectrum s of initial conditions, assuming a fiducial power spectrum S.</p><p>To reconstruct the initial conditions, we need to minimize the negative log-posterior Eq. 12. The snippet of Mesh-TensorFlow code for such a reconstruction using TF Estimator API <ref type="bibr">(Cheng et al., 2017)</ref> is outlined in Listing 7.1. In the interest of brevity, we have skipped variable names and dimension declarations, data I/O and other setup code and included only the logic directly relevant to reconstruction. Using Estimators allows us to naturally reuse the entire machinery of Ten-sorFlow including but not limited to monitoring optimization, choosing from various inbuilt optimization algorithms, restarting optimization from checkpoints and others. As with other Estimator APIs, the reconstruction code can be split into three components with minimal functions as -  &#8226; recon model : this generates the graph for the forward model using FlowPM from variable linear field (initial con-ditions) and metrics to optimize for reconstruction</p><p>&#8226; model fn : this creates the train and predict specs for the TF Estimator API, with the train function performing the gradient based updates</p><p>&#8226; main : this calls the estimator to perform and evaluate reconstruction</p><p>We show the results of this reconstruction in Figure <ref type="figure">13</ref> and 14.</p><p>Here, the forward model is a 5-step PM simulation in a 400 Mpc/h box on 128 3 grid with force resolution of 1. In addition to gradient based optimization outlined in Listing 7.1, we implement adiabatic methods as described in <ref type="bibr">Feng et al. (2018)</ref>; <ref type="bibr">Modi et al. (2018</ref><ref type="bibr">Modi et al. ( , 2019b) )</ref> to assist reconstruction of large scales. We skip those details here in the code for the sake of simplicity, but they are included straightforwardly in this API as a part of recon model and training spec.  We compare our results with the previous reconstruction code in python based on FastPM, vmad and abopt (hence-forth referred to as FastPM reconstruction). Both the reconstructions follow the same adiabatic optimization schedule to promote converge on large scales as described in <ref type="bibr">Modi et al. (2018);</ref><ref type="bibr">Feng et al. (2018)</ref>. FastPM reconstruction uses gradient descent algorithm with single step line-search. Every iteration (including gradient evaluation and line search) takes roughly &#8672; 4.6 seconds on 4 nodes i.e. 128 cores on Cori Haswell at NERSC. We reconstructed for 500 steps, until the large scales converged, in total wall-clock time of &#8672; 2300 seconds. In comparison, for FlowPM reconstruction we take advantage of di&#8629;erent algorithms in-built in TensorFlow and use Adam <ref type="bibr">(Kingma and Ba, 2014)</ref> optimization with initial learning rate of 0.01, without any early stopping or tolerance. Every adiabatic optimization step run for 100 iterations (see <ref type="bibr">Modi et al. (2018)</ref> for details). With every iteration taking &#8672; 1.1 seconds on a single Cori GPU and 500 iterations for total optimization clocked in at &#8672; 550 seconds in wallclock time. These numbers, except time per iteration, will likely change with a di&#8629;erent learning rate and optimization algorithm and we have not explored these in detail for this toy example.</p><p>In Figure <ref type="figure">13</ref>, we show the true initial and data field along with the reconstructed initial field by both codes. In Fig <ref type="figure">14,</ref><ref type="figure"/> we show the cross correlation and transfer function between the reconstructed and true initial fields (solids) as well as the final data field (dashed) for FlowPM. FastPM reconstructs the fields comparably, even though we use a di&#8629;erent optimization algorithm. abopt also provides the flexibility to use L-BFGS algorithm which improves the FastPM reconstruction slightly with a little increase in wall clock of time, up to 4.8 seconds per iteration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">Reconstruction from Halos with a Neural Network model</head><p>In the next example, we consider a realistic case of reconstruction of the initial conditions from dark matter halos under a more complex forward model. Dark matter halos and their masses are not observed themselves. However they are a good proxy for galaxies which are the primary tracers observed in large scale structure surveys, since they capture the challenges posed by galaxies as a discrete, sparse and biased tracer of the underlying matter field. Traditionally, halos are modeled as a biased version of the Eulerian or Lagrangian matter and associated density field <ref type="bibr">(Sheth and Tormen, 1999;</ref><ref type="bibr">Matsubara, 2008;</ref><ref type="bibr">Schmittfull et al., 2018)</ref>. To better capture the discrete nature of the observables at the field level, <ref type="bibr">Modi et al. (2018)</ref> proposed using neural networks to model the halo masses and positions from the Eulerian matter density field. Here we replicate their formalism and reconstruct the initial conditions from the halo mass field as observable.</p><p>In <ref type="bibr">Modi et al. (2018)</ref>, the output of PM simulation is supplemented with two pre-trained fully connected networks, NNp and NNm, to predict a mask of halo positions and estimates of halo masses at every grid point. These are then combined to predict a halo-mass field (M NN ) that is compared with the observed halo-mass field (M d ) upto a pre-chosen number density. They propose a heuristic likelihood for the data inspired from the log-normal scatter of halo masses with respect to observed stellar luminosity. Specifically,</p><p>where M 0 is a constant varied over iterations to assist with optimization, M R are mass-fields smoothed with a Gaussian of scale R and &#181; and are mass-dependent mean and standard deviations of the log-normal error model estimated from simulations.</p><p>The code snippet in Listing 7.2 modifies the code in 7.1 by including these pre-trained neural networks NNp and NNm in the forward model to supplement FlowPM. It also demonstrates how to include associated variables like M 0 in the recon model function that can be updated with ease over iterations to modify the model and optimization. We again compare the results for FastPM and FlowPM reconstruction with the same configuration as before in Figure <ref type="figure">16</ref> and 17. Here, the data is halo mass field with number density n = 10 3 (Mpc/h) 3 on a 128 3 grid for a 400 Mpc/h box. At this resolution and number density, only 17.5% of grid points are non-zero, indicating the sparsity of our data. The forward model is a 5 step FastPM simulation followed by two layer fully connected networks with total 5000 and 600 weights respectively (see <ref type="bibr">Modi et al. (2018)</ref> for details). The position network takes in non-local features as a flattened array that can also be implemented as a convolution kernel of size 3 in TensorFlow. Both the codes reconstruct the initial conditions equally well at the level of cross-correlation. However the transfer functions for both the reconstruction is quite di&#8629;erent.   In addition to the ease of implementing reconstruction in FlowPM, the most significant gain of FlowPM is in terms of the speed of iteration. The time taken for 1 FlowPM iteration is &#8672; 1.7 second while the time taken for 1 iteration in FastPM reconstruction in &#8672; 15 seconds. The huge increase in time for FastPM iterations as compared to FlowPM is due to the python implementation of single convolution kernel and its gradients as required by the neural network bias model, which is very e ciently implemented in TensorFlow. Due to increased complexity in the forward model, gradient descent with line search does not converge at all. Thus we use LBFGS optimization for FastPM reconstruction and it takes roughly 450 iterations. FlowPM implementation, with Adam algorithm as before, does 100 iterations at every step and totals to 1500 iterations since we do not include early stopping. As a result, FlowPM implementation takes roughly 30 minutes on a single GPU while FastPM reconstruction takes roughly 2 hours on 128 processes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Conclusion</head><p>In this work, we present FlowPM -a cosmological N-body code implemented in Mesh-TensorFlow for GPU-accelerated, distributed, and di&#8629;erentiable simulations to tackle the unprecedented modeling and analytic challenges posed by the next generation of cosmological surveys.</p><p>FlowPM implements FastPM scheme as the underlying particle-mesh gravity solver, with gravitational force estimated with 3D Fourier transforms. For distributed computation, we use Mesh-TensorFlow as our model parallelism framework which gives us full flexibility in determining the computational layout and distribution of our simulation. To overcome the bottleneck of large scale distributed 3D FFTs, we propose and implement a novel multi-grid force computation scheme based on image pyramids. We demonstrate that without any fine tuning, this method is able to achieve sub-percent accuracy while reducing the communication data-volume by a factor of 64x. At the same time, given the GPU accelerations, FlowPM is 4-20x faster than the corresponding python FastPM simulation depending on the resolution and compute distribution.</p><p>However the main advantage of FlowPM is the di&#8629;erentiability of the simulation and natural interfacing with deep learning frameworks. Built entirely in TensorFlow, the simulations are di&#8629;erentiable with respect to every component. This allows for development of novel analytic methods such as simulations based inference and reconstruction of cosmological fields, that were hitherto intractable due to the multi-million dimensionality of these problems. We demonstrate with two examples, providing the logical code-snippets, how FlowPM makes the latter straightforward by using TensorFlow Estimator API to do optimization in 128 3 dimensional space. It is also able to naturally interface with machine learning frameworks as a part of the forward model in this reconstruction. Lastly, due to its speed, it is able to achieve comparable accuracy to FastPM based reconstruction in 5 times lower wall-clock time, and 640 times lower computing time.</p><p>FlowPM is open-source and publicly available on our Github repo at <ref type="url">https://github.com/modichirag/flowpm</ref>. We provide example code to do forward PM evolution with single-grid and multi-grid force computation. We also provide code for the reconstruction with both the examples demonstrated in the pa-per. We hope that this will encourage the community to use this novel tool and develop scientific methods to tackle the next generation of cosmological surveys.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0"><p>https://github.com/tensorflow/mesh</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1"><p>In case of the pyramid scheme, the halo size is defined on the high resolution grid and we reduce it by the corresponding downsampling factor for the coarse grid.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_2"><p>https://github.com/rainwoodman/vmad</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_3"><p>https://github.com/bccp/abopt</p></note>
		</body>
		</text>
</TEI>
