<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10229897</idno>
					<idno type="doi">10.1109/ICPADS51040.2020.00035</idno>
					<title level='j'>2020 IEEE 26th International Conference on Parallel and Distributed Systems(ICVPADS)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Wassapon Watanakeesuntorn</author><author>Keichi Takahashi</author><author>Kohei Ichikawa</author><author>Joseph Park</author><author>George Sugihara</author><author>Ryousei Takano</author><author>Jason Haga</author><author>Gerald M. Pao</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of data collection capabilities, there is a great need to identify causal relationships in large datasets. We present mpEDM, a parallel distributed implementation of EDM optimized for modern GPU-centric supercomputers. We improve the original algorithm to reduce redundant computation and optimize the implementation to fully utilize hardware resources such as GPUs and SIMD units. As a use case, we run mpEDM on AI Bridging Cloud Infrastructure (ABCI) using datasets of an entire animal brain sampled at single neuron resolution to identify dynamical causation patterns across the brain. mpEDM is 1,530× faster than cppEDM and a dataset containing 101,729 neuron was analyzed in 199 seconds on 512 nodes. This is the largest EDM causal inference achieved to date.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Reverse-engineering and building a digital reconstruction of the brain is one of the greatest scientific challenges of today. A recent study on the mouse cortex <ref type="bibr">[1]</ref> showed that 97% of the possible connections between neurons exist. This result suggests that it is likely more informative to investigate the dynamic interactions between neurons rather than the static connectivity between them to fully understand the function of the brain. Based on this insight, we are building mathematical and computational tools to analyze the dynamic interactions between neurons based on Empirical Dynamic Modeling (EDM).</p><p>EDM is a nonlinear time series causal inference framework based on the generalized Takens' embedding theorem on state space reconstruction <ref type="bibr">[2]</ref>. EDM is used to study and predict the behavior of nonlinear dynamical systems. Convergent Cross Mapping (CCM) is one of the EDM algorithms that allows to estimate the existence and strength of the causal strength between two time series in a dynamical system <ref type="bibr">[3]</ref>.</p><p>In this study, we utilize CCM to infer the causal relationships between every neuron in an entire brain and construct a causal map that describes the dynamic interactions among neurons. For this purpose, we have recorded the neural activity (i.e. firing rate) of an entire larval zebrafish brain at singe-neuron resolution by using light sheet fluorescence microscopy. The original implementation of EDM, cppEDM, has mostly been used for individual time series of relatively short length and and mostly small numbers of variables for its computational cost. Since a larval zebrafish brain contains approximately 10 5 neurons, a staggering number of 10 10 cross mappings need to be performed in total. CCM of this enormous scale has never been achieved so far because of the sheer amount of computation required.</p><p>The goal of this paper is to develop a highly scalable and optimized implementation of EDM that is able to analyze the whole zebrafish brain dataset within a reasonable time. We present mpEDM 1 , a parallel distributed implementation of EDM optimized for execution on modern GPU-centric supercomputers. We improve the original algorithm in cppEDM to reduce redundant computation and optimize the implementation to fully utilize hardware resources such as GPUs and SIMD units.</p><p>Our evaluation on AI Bridging Cloud Infrastructure (ABCI), Japan's most high performance supercomputer as of today, demonstrated the unprecedented performance of mpEDM. mpEDM was used to analyze a dataset containing the activity of 53,053 neurons in only 20 seconds using 512 ABCI nodes. In contrast, cppEDM took 8.5 hours to analyze the same dataset using the same number of nodes <ref type="bibr">[4]</ref>. Furthermore, mpEDM analyzed a larger dataset containing 101,729 neurons in 199 seconds on 512 nodes. To our knowledge, this is the largest CCM calculation achieved to date. This result shows the potential for mpEDM and ABCI to analyze even larger datasets in the future.</p><p>The rest of this paper is structured as follows. Section II describes the background of this research and EDM algorithm. Section III explains our proposal to improve the algorithm of the EDM for parallelization and to support GPU architecture. Section IV evaluates the performance of mpEDM and presents the scientific outcomes obtained with mpEDM. Finally, section V concludes this paper and discusses future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND A. Causal Map of the Zebrafish Brain at Single Neuron Resolution</head><p>To understand the human brain activity dynamics with a complexity of 10 11 neurons and 10 15 synapses at single neuron resolution is currently a technically impossible task. Similarly a mouse brain with 7.6&#215;10 7 neurons is not tractable because mammalian brains are opaque and it is impossible to image a complete mouse brain. With this in mind, the zebrafish embryo is an attractive model system with 120,000 neurons and transgenic technology as well as natural brain transparency. The zebrafish embryo is sufficiently complex to exhibit interesting behaviors and is technologically feasible to study to infer basic principles of systems neuroscience. Even in the case of the larval zebrafish with about 120,000 neurons we do not have the physical connectivity map, that is the connectome of the larval zebrafish, nor do we have the synaptic strengths which are pieces of information required to understand the brain starting from the physical connectivity.</p><p>Complicating this notion, recent work from the mouse brain shows that 97% of possible physical connections exist within the mouse cortex thus making it difficult to analyze. Given this difficulty, using an analogy of a city; to understand how a city works it will be easier to understand the city from the traffic patterns than from the street map. Thus, we wished to analyze the fish brain at single neuron resolution from a network activity dynamics perspective. Although imperfect, we used neural activity imaging data of an entire brain at single cell resolution in a behaving larval zebrafish (a transparent vertebrate) to extract all relationships in an intact vertebrate brain.</p><p>To achieve this, we recorded whole brain neural activity patterns in multiple animals experiencing hypoxia using a Selective Plane Illumination Microscope (SPIM) <ref type="bibr">[5]</ref>. We obtained data from the entire 5-day-old larval brain (120,000 neurons) at 2 Hz in response to hypoxia for varying amounts of time typically ranging from 1,500 time steps to up to 8,000+ <ref type="bibr">[6]</ref>.</p><p>CCM allows the inference of causation from nonlinear time series even with substantial noise and complete absence of Fig. <ref type="figure">1</ref>: Basic Idea Behind Empirical Dynamic Modeling correlation <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>. We used CCM and other tools from the EDM framework for the inference of existence, strength and sign of causal relationships within the neural activity network of the transparent larval fish brain <ref type="bibr">[5]</ref>. CCM determines whether and how much causality exists between individual neurons. The adjacency in the network is determined by time delay cross mapping <ref type="bibr">[8]</ref>. Predictive accuracy values give the interaction strength allow us to infer relationships within the neural network without observing the physical connectivity. As a test case, we have collected multiple data sets of lengths around 1600 time steps at 2 Hz which contain 50,000-80,000 active neurons in most cases. We have analyzed this data and show that the generated time series are suitable for causal network inference using the EDM framework and thus demonstrated a proof of principle of computational tractability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Empirical Dynamic Modeling</head><p>EDM is a mathematical framework designed for studying nonlinear dynamical systems. EDM is based upon the concept of state space reconstruction (SSR) <ref type="bibr">[9]</ref>. Takens' theorem states that the attractor manifold of a multivariate dynamical system can be reconstructed from time lagged coordinates of a single time series variable <ref type="bibr">[10]</ref>. Figure <ref type="figure">1</ref> illustrates the concept of state spaces reconstruction. In this example, three causally related time series variables x(t), y(t) and z(t) that constitute a dynamical system form an attractor manifold M in the state space. A shadow manifold M x can be reconstructed using the time delayed embeddings of x (x(t), x(t&#964; ), x(t -2&#964; ), . . . ), where &#964; denotes the time lag. In the same manner, lags of y form a shadow manifold M y . Takens' theorem states that the reconstructed manifolds M x and M y preserve essential mathematical properties (such as the topology) of the true manifold M . In particular, there exist smooth mappings between M , M x , and M y , suggesting that neighbors in M x are neighbors in M y as well.</p><p>Simplex projection is a nonlinear forecasting algorithm often used for estimating the dimensionality of a dynamical system. In simplex projection, the input time series is split into two halves: library x and target y. Both halves are embedded into E-dimensional state space by using delayed embeddings. Given a point y(t p ) = (y(t p ), y(t p -1), . . . , y(t p -E + 1)) in the target state space, its E + 1 nearest neighbors (i.e. vertices of the simplex enclosing y(t p )) are searched from the embedded library. Suppose those neighbors are x 1 (t 1 ), x 2 (t 2 ), . . . , x E+1 (t E+1 ). A forecast y(t p + 1) can be made by averaging the future of the neighbors in the library: x 1 (t 1 + 1), x 2 (t 2 + 1), . . . , x E+1 (t E+1 + 1). This prediction is performed for every point in y and the results are compared with the true y to evaluate the prediction accuracy. This entire procedure is repeated for different E values and the E that achieves the highest prediction accuracy is determined as the optimal embedding dimension of the dynamical system. CCM determines the existence and strength of causality between two time series variables <ref type="bibr">[11]</ref>. It works similar to simplex projection, but instead of predicting within a single time series, CCM predicts one time series from another. If y can be predicted from x with significant accuracy, we conclude that y CCM causes x.</p><p>There have been extensive studies on causal inference. Structural Causal Model (SCM) is one of the most popular causal models <ref type="bibr">[12]</ref> based on statistical modeling of equilibrium systems. In contrast to SCM, EDM is based on the principle of state-space reconstruction shown in Takens' theorem of non-equilibrium systems. Granger causality is another causal inference technique based on statistical modeling <ref type="bibr">[13]</ref>. Granger causality however as stated by Granger himself, only works with linear and stochastic systems and cannot be applied to a nonlinear dynamical system. Compared to these alternatives, EDM is better suited to find the causal relationships in a nonlinear dynamical system such as the brain. Tajima et al. <ref type="bibr">[14]</ref> also applied embedding theorems in nonlinear state-space reconstruction to analyze a dynamic system. They also built on the causality inference method from Sugihara et al. <ref type="bibr">[3]</ref> in their work.</p><p>EDM has been successfully applied to diverse research fields <ref type="bibr">[15]</ref>. In neuroscience, CCM was applied to identify the effective connectivity between brain areas from magnetoencephalography (MEG) data <ref type="bibr">[16]</ref>. In ecology, Grziwotz et al. found the causal relationships between the environment and mosquito abundance by using CCM <ref type="bibr">[17]</ref>. Environmental factors, such as temperature, precipitation, dew point, air pressure, and mean tide level were identified to causally affect mosquito abundance. Ma et al. applied simplex projection to forecast wind generation <ref type="bibr">[18]</ref>. In <ref type="bibr">[19]</ref>, an EDM algorithm called S-Map <ref type="bibr">[20]</ref> was used to find the relationship between harvested and unharvested fish in terms of size, age, and others. Luo et al. applied CCM to estimate the causal relationships of user behavior in an online social network <ref type="bibr">[21]</ref>. These use cases demonstrate the wide applicability of EDM to analyze nonlinear dynamical systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. cppEDM</head><p>cppEDM <ref type="bibr">[22]</ref> is the latest implementation of the EDM framework. cppEDM is a general purpose C++ library used as a backend by rEDM <ref type="bibr">[23]</ref> and pyEDM <ref type="bibr">[24]</ref>, which are EDM implementations for the R and Python language, respectively.</p><p>We have identified two major issues in cppEDM that hinder large-scale analysis on HPC systems: redundant computation and lack of GPU support. Since cppEDM is a general purpose library, it provides a one-to-one cross mapping function to identify the causality between a selected combinations of time series variables. The all-to-all cross mapping function is implemented by reusing the one-to-one cross mapping function. This results in redundant computation. Additionally, cppEDM is a reference implementation of EDM; therefore, it is not optimized for a specific hardware architecture such as GPUs. Furthermore, cppEDM suffers from significant load imbalance among workers because it performs static decomposition of the problem. In fact, a performance evaluation in a previous work showed that the runtime of workers varied greatly from 5 hours to 8.5 hours <ref type="bibr">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. MPEDM</head><p>In this section, we first outline the original causal inference algorithm in cppEDM. Then, we describe the algorithmic improvement and the design of the inter-node and intra-node parallelization in mpEDM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Original Algorithm</head><p>Algorithm 1 outlines the causal inference algorithm in cppEDM. The input to the algorithm is an L &#215; N array ts, where L is the number of time steps within a time series and N is the number of time series. In addition to the input dataset, maximum embedding dimension E max and time lag &#964; need to be supplied. The output is an N &#215; N casual map &#961;. The algorithm consists of two phases: (1) simplex projection and (2) CCM. Simplex projection finds the optimal embedding dimension for each time series. CCM estimates the causal relationship between two time series using the optimal embedded dimension obtained in the first phase. Note that in the original definition of CCM, predictions are made multiple times using randomly subsampled library sets of different sizes and it is tested whether increasing the library set size improves the prediction accuracy. In this research, we excluded this step since the convergence test passes in most cases if the prediction using the full library set achieves high accuracy.</p><p>In the first phase, simplex projection (line 1-11) takes a time series in the dataset and splits into into library, the first half, and target, the second half (line 3-4). Next, both library and target are embedded into E-dimensional space using time delayed embeddings. A k-nearest neighbors (kNN) search is performed in the state space to find the E + 1 nearest target points from each library point (line 5). The search results are stored in two lookup tables indices and distances, both of which are two-dimensional arrays of shape L &#215; (E + 1). Element (i, j) in the indices array is the index of the j-th nearest target point from library point i, whereas element (i, j) in the distances array is the Euclidean distance between the library point i and its j-th nearest target point. The distances array is then converted to exponential scale and each row is normalized (line 6). A one step ahead prediction of a target point is made by <ref type="bibr">(1)</ref> obtaining the indices of its E + 1 library neighbors from indices, (2) obtaining the one step ahead values of those library points from library and (3) computing a weighted average of the future library points using distances (line 7). Finally, Pearson's correlation coefficient is computed to evaluate the predictive skill of the simplex projection using the prediction results and real observed withheld values (line 8). This is repeated for every E ranging from 1 to E max (&#8804;20 in practice). The E value that achieves the highest accuracy is determined to be the optimal embedding dimension for the time series and stored in optE (line 10).</p><p>In the second phase, CCM (line 12-19) works similar to simplex projection but predicts between two different time series. A given library time series is used to cross predict another target time series in the dataset to evaluate whether the latter is the cause of the former. It computes and normalizes the kNN tables from the library time series (line 14-15) and uses the tables to predict the target time series (line 16). Note that simplex projection predicts within the same time series while CCM predicts across two different time series. Therefore, the kNN tables computed in the simplex projection phase cannot be reused in the CCM phase. The correlation between the predicted values and the actual values represents strength of causality (line 17). In this manner, causal inference is performed for all combinations of time series in the dataset.</p><p>We have profiled cppEDM and found out that over 97% of the total runtime is spent in the kNN search. In addition, we have discovered that the time delayed embedding in cppEDM replicates the time series E + 1 times and causes significant memory overhead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Improved Algorithm</head><p>The key observation behind our algorithmic improvement is that the kNN lookup table for CCM is constructed from the library time series only, and the target time series is not used. This suggests that once the kNN lookup table is computed for a particular library time series, we can reuse the precomputed table to make predictions for every target time series. This improvement is trivial if N is in the same order as E max , which was the case in previous use cases of EDM. However, in our use case N is equal to the number of active neurons in a zebrafish brain, which is roughly 10 5 . Therefore, the potential speedup becomes significantly large.</p><p>Algorithm 2 shows the pseudocode of the improved causal inference algorithm in mpEDM. The simplex projection algorithm is unchanged from cppEDM but its kNN and lookup functions are parallelized and optimized. The CCM algorithm in mpEDM is improved in the following manner. For each library time series, we first compute the kNN lookup tables for every embedding dimension ranging from 1 to E max (line 4-7). Then, we iterate through all target time series and use the precomputed lookup table for the optimal embedding dimension of the target time series to predict the target time series (line 9-10). Finally, we compute the correlation between the prediction and the actual target to estimate the causality (line 11).</p><p>Algorithms 3 outlines the kNN function for CPU. We first calculate the all-to-all distances between every library and target point in the state space. Note that we do not explicitly create the time series embeddings on memory but we compute them on-the-fly to reduce memory footprint and increase cache hit. In addition, both indices and distances are stored in row-major format to match the access pattern. Then, each row in the distances and indices arrays is partially sorted in descending order using the distances as sort keys. We use heap sort to implement partial sort. After the sorting, both arrays are trimmed from L &#215; L to L &#215; (E + 1) and returned. Algorithm 4 shows the kNN function for GPU. In the GPU version, we create time series embeddings on the host and transfer them to the device. The kNN search is executed on the GPU and the resulting kNN tables are returned to the host.</p><p>Algorithm </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Inter-Node Parallelism</head><p>To distribute the work across multiple compute nodes, we naturally choose the loops with the highest granularity. That is, the two outermost loops that iterate over the time series (line 1-2 and 3-13 in Algorithm 2). We implement a simple master-worker framework based on MPI to distribute these loops. To dynamically distribute work and mitigate load imbalance among workers, we adopt self-scheduling in our masterworker framework. In self-scheduling, the master accounts and dispatches tasks to workers. Each worker performs assigned tasks, and once it completes, the worker asks the master for a new task.</p><p>The high-level organization of the inter-node parallelism is as follows. First, the workers execute the embedding dimension phase. The optimal embedding dimension for each time series is reported back to the master. Once the first Algorithm 3: kNN for CPU Input: library and target time series, embedding dimension E, time lag &#964; Output: Arrays diatances and indices for lookup // All-to-all distance calculation Copy indices and distances to host phase is complete, the master broadcasts optE to all workers. Subsequently, the workers execute the all-to-all CCM phase. The final results are written to the file system by each worker to alleviate the load on the master.</p><p>Both the input dataset and the inferred causal map are stored as HDF5 <ref type="bibr">[25]</ref> files for easy integration with the pre/post processing workflow. The workers read the input HDF5 file in parallel and keep the entire dataset on memory during the execution. Every time a worker completes a cross map, the worker writes an element of the causal map asynchronously to the output HDF5 file. This small random write pattern, however, is known to be slow on parallel file systems. In fact,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 5: Lookup</head><p>Input: Array of indices and distances, target time series, embedding dimension E of target Output: Prediction of the time series prediction</p><p>we observed that write I/O becomes a significant bottleneck of the application on GPFS. We therefore take advantage of BeeOND (BeeGFS On Demand) <ref type="bibr">[26]</ref>, the burst buffer deployed on ABCI. BeeOND combines local SSDs installed on the compute nodes and provides an on-demand parallel file system to a job. The workers write the results to BeeOND to minimize I/O overhead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Intra-Node Parallelism</head><p>We focus our efforts to parallelize and optimize the kNN kernel since it is the primary bottleneck in cppEDM as discussed in section III-A. We design and implement kNN kernels for both CPU and GPU architecture to ensure that mpEDM can efficiently run on a wide variety of computing platforms. In the kNN kernel for CPU shown in Algorithm 3, the two loops that iterate over the time steps within a time series are parallelized using OpenMP (line 1-9 and 10-13 in Algorithm 3). We also utilize OpenMP 4.0 SIMD directives to vectorize the innermost loop explicitly. Note that the nested loops are ordered such that the memory accesses in the innermost loop are contiguous.</p><p>In the kNN kernel for GPU shown in Algorithm 4, we take advantage of ArrayFire <ref type="bibr">[27]</ref>, a highly optimized library for GPU-accelerated computing. ArrayFire provides backends for CUDA, OpenCL and CPU, but in this paper we only use the CUDA backend since ABCI is installed with Tesla V100 GPUs. The kNN algorithm implemented in ArrayFire is essentially the same as our CPU implementation. ArrayFire uses a block-wide parallel radix sort implementation in the CUDA UnBound (CUB) template library. Since each ABCI compute node is equipped with four GPUs, we also distribute the work across multiple GPUs. To achieve this, the loop that iterates over E (line 4-7 in Algorithm 2) is parallelized such that each GPU computes lookup tables for one or more E. We dynamically schedule this loop to ensure load balancing across GPUs because the runtime of the kNN kernel depends on E as discussed in section III-B.</p><p>For the lookup kernel shown in Algorithm 5, we currently only have a CPU version of this kernel. The time step loop is parallelized using OpenMP (line 1-8 in Algorithm 5). This kernel is heavily memory bandwidth bound since it requires random memory access.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. EVALUATION</head><p>The computational performance of mpEDM was evaluated on ABCI. Furthermore, we present the scientific outcomes obtained using mpEDM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Evaluation Environment</head><p>ABCI <ref type="bibr">[28]</ref> is the world's first large-scale Open AI Computing Infrastructure, which is constructed and operated by the National Institute of Advanced Industrial Science and Technology (AIST). According to the latest TOP500 list published in November 2019 <ref type="bibr">[29]</ref>, ABCI is the most powerful supercomputer in Japan and the 8th in the world. ABCI has 1,088 compute nodes, each equipped with two 20-core Intel Xeon Gold 6148 CPUs, four NVIDIA Tesla V100 SXM2 (16GB) GPUs, 384GB of RAM and 1.6TB of local NVMe SSD. The parallel file system is based on GPFS with a total capacity of 22PB.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Performance Evaluation</head><p>We compared mpEDM with cppEDM from the following three aspects: total runtime, parallel scalability and impact of dataset size on the runtime. We used three real-world datasets recorded from larval zebrafish under different conditions. Table <ref type="table">I</ref> shows the list of datasets used in the evaluation. 1) Total Runtime: mpEDM shows significantly higher performance compared to cppEDM. Table <ref type="table">II</ref> shows the performance comparison between cppEDM and mpEDM. cppEDM took 8.5 hours to analyze the Fish1 Normo dataset using 512 ABCI nodes <ref type="bibr">[4]</ref>, whereas mpEDM took only 20 seconds to analyze the same dataset using 512 ABCI nodes with GPU architecture. The result shows that mpEDM is 1,530&#215; faster than cppEDM. Moreover, mpEDM finished the causal inference of two larger datasets: Subject6 in 101 seconds and Subject11 <ref type="bibr">[6]</ref> in 199 seconds.  2) Parallel Scalability: We measured the parallel scalability of mpEDM by varying the number of workers and measuring the runtime of mpEDM with and without GPU. We used the largest Subject11 in this Figure <ref type="figure">2</ref> shows the strong scaling performance of mpEDM. In the Single Node setup, mpEDM is executed on a single node without MPI. In the X Workers setup, mpEDM is executed with MPI using the specified number of workers. We measured up to 511 workers since ABCI allows a maximum of 512 nodes per job (except for jobs running under the ABCI grand challenge program, which can use the full 1,088 nodes). The result shows that the GPU version runs as twice as fast as the CPU version in every case. We noticed that the CPU version ran in the single worker setup 10% slower than the single node setup. We believe this slowdown is caused from the interference between the background tasks performed by the BeeOND daemon and the computation in mpEDM. This does not happen with the GPU version because the average CPU utilization is lower than the CPU version.</p><p>Figure <ref type="figure">3</ref> shows the relative speedup of the multi-node setup in relation to the single node setup. It reveals that the speedup is nearly linear with both GPU and CPU. However, the speedup of the GPU version drops when the number of nodes is 64 or more.</p><p>We measured the breakdown of each phase to investigate the cause behind the scalability decline. We compared 32 workers and 128 workers since the GPU version declines beyond 64 nodes. Figures <ref type="figure">4</ref> and<ref type="figure">5</ref> show the breakdown of average runtime for processing a single time series in simplex projection and CCM. The two figures clearly indicate that memory copy, MPI communication and I/O are not bottlenecks and do not significantly increase with the number of workers. However, the kNN function becomes slower when the number of workers increases. We found out that the kNN search for the first time series processed on a worker is significantly slower (ranging from 3.3 seconds to 16.4 seconds) than the subsequent ones. We believe this is caused by the initialization process of the GPUs. To verify this, we created a simple program that initializes the GPUs and allocates some GPU memory on a single node. We submitted a job that run this program 100 times and measured the initialization time. The result revealed that the initialization time follows a long-tailed distribution: the median was 4.6 seconds while the maximum was 22.9 seconds. This suggests that a few stragglers impact the total runtime and degrade the scalability as the number of workers increases.</p><p>3) Impact of Dataset Size: We evaluated how the size of the dataset impacts the runtime of mpEDM using dummy datasets with different sizes. Furthermore, we measured the time spent in each function. We also measured the speedup of the GPU version over the CPU version with varying number of time steps.</p><p>Figures <ref type="figure">6</ref> and<ref type="figure">7</ref> show the runtime of mpEDM when increasing the number of time series and time steps, respectively. We confirmed that the increase of runtime is not bigger than the increase predicted from the time complexity. We also confirmed that CCM consumed the majority of the total runtime and other tasks including I/O and MPI communication Figures <ref type="figure">8a</ref> and<ref type="figure">8b</ref> show the runtime breakdown of each function in CCM when increasing the number of time series and time steps, respectively. Figure <ref type="figure">8a</ref> shows that the runtime of the lookup function becomes dominant when increasing the number of time series. On the other hand, Fig. <ref type="figure">8b</ref> shows that the runtime of the kNN function becomes dominant when increasing the number of time steps. These trends can be explained from the time complexity analysis of each algorithm described in section III-B.</p><p>Figure <ref type="figure">9</ref> shows the speedup of the GPU version over the CPU version when varying number of time steps. We compared the performance between a single CPU socket and one or more GPUs to evaluate the GPU speedup. Evidently, the GPU speedup increases with the number of time steps. Single GPU is slower than the CPU if the number of time steps is 2,000 or less. This is because of the overhead inherent to offloading computation to the GPU. However, single GPU consistently surpasses the CPU if the number of time steps if 5,000 or more. If the number of time steps is 40,000, the speedup of a single GPU is 3.5 times compared to CPU. When four GPUs are used, the speed up is 13.4 times.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Scientific Outcomes</head><p>Figure <ref type="figure">10</ref> shows the scientific outcomes obtained using mpEDM. Our results showed that we could determine the causal connectivity across the entire brain across two behaviors. This shows that depending on task, the network of relationships between individual neurons change and become more connected, homogeneous and simplified with a goal directed task. In the resulting network connectivity increased and became simpler. Furthermore, we were able identify individual neurons that integrate signals from multiple other neurons that contain decision making information. These neurons allow the prediction of fish turn behaviors while swimming and generate low dimensional manifold models based on data geometry that are able to predict the fish's behavior at least 0.5 seconds (a single time step) ahead. A three dimensional projection of one of these manifolds is shown in Fig. <ref type="figure">10 (G)</ref>, where entering the loop predicts turn behavior. Based on the combined activity of two neurons and information on prior states we are able to predict when the fish will turn. Beyond this, this is the first map of causal connectivity of any vertebrate animal at single neuron resolution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CONCLUSION &amp; FUTURE WORK</head><p>EDM is a nonlinear time series analysis framework proven its applicability in various fields. However, EDM has only been applied to small datasets due to its computational cost. In this paper, we designed and implemented mpEDM, a parallel distributed implementation of EDM optimized for execution on modern GPU-centric supercomputers. mpEDM improves the EDM algorithm to reduce redundant computation and optimizes the implementation to fully utilize hardware resources such as GPUs and SIMD units. mpEDM took only 20 seconds to finish the causal inference of a dataset containing the activity of 53,053 zebrafish neurons on 512 ABCI nodes. This is 1,530&#215; faster than cppEDM, the current standard implementation of EDM. Moreover, mpEDM could analyze a 13&#215; larger dataset in 199 seconds. This is the largest EDM causal inference achieved to date.</p><p>We will continue to optimize the performance of mpEDM. As discussed in section IV-B, we need to improve the performance of the lookup as it becomes the primary bottleneck when we scale up the number of time series further. We will also explore other efficient implementations of nearest neighbor search on GPUs. Currently, mpEDM uses the exact kNN search implementation provided by ArrayFire. There exist many studies on efficient Approximate Nearest Neighbor (ANN) search <ref type="bibr">[30]</ref>, <ref type="bibr">[31]</ref>. However, it is unclear how ANN affects the accuracy of EDM predictions. Another well-known approach is to use spatial indices such as KD-trees and Balltrees to accelerate kNN search <ref type="bibr">[32]</ref>, <ref type="bibr">[33]</ref>.</p><p>Additionally, EDM algorithms other than simplex projection and CCM will be implemented in mpEDM to expand mpEDM to a standard implementation of EDM on HPC systems. We will make this EDM library widely available to the community with a hope to assist scientists in need to analyze large-scale time series datasets of nonlinear dynamical systems.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: BOSTON UNIVERSITY. Downloaded on May 19,2021 at 00:11:38 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
