- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
- Page Range or eLocation-ID:
- 231 to 242
- Sponsoring Org:
- National Science Foundation
More Like this
Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, “accelerator wall”, we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present ACCPAR, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, ACCPAR considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. ACCPAR optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in ACCPAR allows the flexible ratio of computations to be distributed amongmore »
Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data . This heterogeneous cluster uses innovative scheduling technology, Slurm , that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus . We use TensorFlow  as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »
Accurate synthetic seismic wavefields can now be computed in 3-D earth models using the spectral element method (SEM), which helps improve resolution in full waveform global tomography. However, computational costs are still a challenge. These costs can be reduced by implementing a source stacking method, in which multiple earthquake sources are simultaneously triggered in only one teleseismic SEM simulation. One drawback of this approach is the perceived loss of resolution at depth, in particular because high-amplitude fundamental mode surface waves dominate the summed waveforms, without the possibility of windowing and weighting as in conventional waveform tomography.
This can be addressed by redefining the cost-function and computing the cross-correlation wavefield between pairs of stations before each inversion iteration. While the Green’s function between the two stations is not reconstructed as well as in the case of ambient noise tomography, where sources are distributed more uniformly around the globe, this is not a drawback, since the same processing is applied to the 3-D synthetics and to the data, and the source parameters are known to a good approximation. By doing so, we can separate time windows with large energy arrivals corresponding to fundamental mode surface waves. This opens the possibility ofmore »
Here we present the results of proof of concept testing of such an approach for a synthetic 3-component long period waveform data set (periods longer than 60 s), computed for 273 globally distributed events in a simple toy 3-D radially anisotropic upper mantle model which contains shear wave anomalies at different scales. We compare the results of inversion of 10 000 s long stacked time-series, starting from a 1-D model, using source stacked waveforms and station-pair cross-correlations of these stacked waveforms in the definition of the cost function. We compute the gradient and the Hessian using normal mode perturbation theory, which avoids the problem of cross-talk encountered when forming the gradient using an adjoint approach. We perform inversions with and without realistic noise added and show that the model can be recovered equally well using one or the other cost function.
The proposed approach is computationally very efficient. While application to more realistic synthetic data sets is beyond the scope of this paper, as well as to real data, since that requires additional steps to account for such issues as missing data, we illustrate how this methodology can help inform first order questions such as model resolution in the presence of noise, and trade-offs between different physical parameters (anisotropy, attenuation, crustal structure, etc.) that would be computationally very costly to address adequately, when using conventional full waveform tomography based on single-event wavefield computations.
We present the design and implementation details of a geometric multigrid method on adaptively refined meshes for massively parallel computations. The method uses local smoothing on the refined part of the mesh. Partitioning is achieved by using a space filling curve for the leaf mesh and distributing ancestors in the hierarchy based on the leaves. We present a model of the efficiency of mesh hierarchy distribution and compare its predictions to runtime measurements. The algorithm is implemented as part of the deal.II finite-element library and as such available to the public.
We address the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those, including retrieving full or partial versions, and evolution histories for specific keys. We motivate the increasing need for such a system in a variety of application domains, carefully explore the design space for building such a system and the various storage-computation-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. We propose a novel system architecture that satisfies the key desiderata for such a system, and offers simple tuning knobs that allow adapting to a specific data and query workload. Our system is intended to act as a layer on top of a distributed key-value store that houses the raw data as well as any indexes. We design novel off-line storage layout algorithms for efficiently partitioning the data to minimize the storage costs while keeping the retrieval costs low. We also present an online algorithm to handle new versions being added to system. Using extensive experiments on large datasets, we demonstrate that our system operates at the scale required in most practical scenarios andmore »