Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, “accelerator wall”, we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present ACCPAR, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, ACCPAR considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. ACCPAR optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in ACCPAR allows the flexible ratio of computations to be distributed among accelerators with different performances. The proposed search algorithm is also applicable to the emerging multi-path patterns in modern DNNs such as ResNet. We simulate ACCPAR on a heterogeneous accelerator array composed of both TPU-v2 and TPU-v3 accelerators for the training of large-scale DNN models such as Alexnet, Vgg series and Resnet series. The average performance improvements of the state-of-the-art “one weird trick” (OWT) and HYPAR, and ACCPAR, normalized to the baseline data parallelism scheme where each accelerator replicates the model and processes different input data in parallel, are 2.98×, 3.78×, and 6.30×, respectively.
more »
« less
Machine and Application Aware Partitioning for Adaptive Mesh Refinement Applications
Load balancing and partitioning are critical when it comes to parallel computations. Popular partitioning strategies based on space filling curves focus on equally dividing work. The partitions produced are independent of the architecture or the application. Given the ever-increasing relative cost of data movement and increasing heterogeneity of our architectures, it is no longer sufficient to only consider an equal partitioning of work. Minimizing communication costs are equally if not more important. Our hypothesis is that an unequal partitioning that minimizes communication costs significantly can scale and perform better than conventional equal-work partitioning schemes. This tradeoff is dependent on the architecture as well as the application. We validate our hypothesis in the context of a finite-element computation utilizing adaptive mesh-refinement. Our central contribution is a new partitioning scheme that minimizes the overall runtime of subsequent computations by performing architecture and application-aware non-uniform work assignment in order to decrease time to solution, primarily by minimizing data-movement. We evaluate our algorithm by comparing it against standard space-filling curve based partitioning algorithms and observing time-to-solution as well as energy-to-solution for solving Finite Element computations on adaptively refined meshes. We demonstrate excellent scalability of our new partition algorithm up to 262,144 cores on ORNL's Titan and demonstrate that the proposed partitioning scheme reduces overall energy as well as time-to-solution for application codes by up to 22.0%
more »
« less
- PAR ID:
- 10066623
- Date Published:
- Journal Name:
- Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
- Volume:
- 2017
- Page Range / eLocation ID:
- 231 to 242
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)We present the design and implementation details of a geometric multigrid method on adaptively refined meshes for massively parallel computations. The method uses local smoothing on the refined part of the mesh. Partitioning is achieved by using a space filling curve for the leaf mesh and distributing ancestors in the hierarchy based on the leaves. We present a model of the efficiency of mesh hierarchy distribution and compare its predictions to runtime measurements. The algorithm is implemented as part of the deal.II finite-element library and as such available to the public.more » « less
-
We propose a capacity-achieving scheme for private information retrieval (PIR) from databases (DBs) with heterogeneous storage constraints. In the PIR setting, a user queries a set of DBs to privately download a message, where privacy implies that no one DB can infer which message the user desires. Our PIR scheme uses an uncoded storage placement and we derive sufficient conditions to meet capacity in this design architecture. We translate the storage placement design to a "filling problem" where messages are partitioned into sub- messages and stored at subsets of DBs. We prove a set of necessary and sufficient conditions for the existence of the filling problem solution and design an iterative algorithm to find a filling problem solution. Our proposed algorithm requires at most a number of iterations equal to the number of DBs. Furthermore, we significantly reduce the number of sub-messages compared to the state-of- the-art PIR scheme, as our proposed PIR scheme requires that each message is split into a polynomial number of sub-messages with respect to the number of DBs.more » « less
-
Due to the developments of topographic techniques, clear satellite imagery, and various means for collecting information, geospatial datasets are growing in volume, complexity, and heterogeneity. For efficient execution of spatial computations and analytics on large spatial data sets, parallel processing is required. To exploit fine-grained parallel processing in large scale compute clusters, partitioning in a load-balanced way is necessary for skewed datasets. In this work, we focus on spatial join operation where the inputs are two layers of geospatial data. Our partitioning method for spatial join uses Adaptive Partitioning (ADP) technique, which is based on Quadtree partitioning. Unlike existing partitioning techniques, ADP partitions the spatial join workload instead of partitioning the individual datasets separately to provide better load-balancing. Based on our experimental evaluation, ADP partitions spatial data in a more balanced way than Quadtree partitioning and Uniform grid partitioning. ADP uses an output-sensitive duplication avoidance technique which minimizes duplication of geometries that are not part of spatial join output. In a distributed memory environment, this technique can reduce data communication and storage requirements compared to traditional methods. To improve the performance of ADP, an MPI+Threads based parallelization is presented. With ParADP, a pair of real world datasets, one with 717 million polylines and another with 10 million polygons, is partitioned into 65,536 grid cells within 7 seconds. ParADP performs well with both good weak scaling up to 4,032 CPU cores and good strong scaling up to 4,032 CPU cores.more » « less
-
In recent years, there has been significant interest in the development of finite element methods defined on meshes that include rather general polytopes and curvilinear polygons. In the present work, we provide tools necessary to employ multiply connected mesh cells in planar domains, i.e., cells with holes, in finite element computations. Our focus is efficient evaluation of the \(H^1\) semi-inner product and \(L^2\) inner product of implicitly defined finite element functions of the types arising in boundary element based finite element methods and virtual element methods. Such functions are defined as solutions of Poisson problems having a polynomial source term and continuous boundary data. We show that the integrals of interest can be reduced to integrals along the boundaries of mesh cells, thereby avoiding the need to perform any computations in cell interiors. The dominating cost of this reduction is solving a relatively small Nyström system to obtain a Dirichlet-to-Neumann map, as well as the solution of two more Nyström systems to obtain an “anti-Laplacian” of a harmonic function, which is used for computing the \(L^2\) inner product. Several numerical examples demonstrate the high-order accuracy of this approach. Reproducibility of computational results. This paper has been awarded the “SIAM Reproducibility Badge: code and data available” as a recognition that the authors have followed reproducibility principles valued by SISC and the scientific computing community. Code and data that allow readers to reproduce the results in this paper are available at both https://github.com/samreynoldsmath/PuncturedFEM and the supplementary materials (PuncturedFEM\_v0\_2\_5.zip [1.75MB]).more » « less