Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by mapping Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards.
more »
« less
Massively Parallel Automated Software Tuning
This article presents an implementation of a distributed autotuning engine developed as part of the Bench-testing OpenN Software Autotuning Infrastructure project. The system is geared towards performance optimization of computational kernels for graphics processing units, and allows for the deployment of vast autotuning sweeps to massively parallel machines. The software implements dynamic work scheduling to distributed-memory resources and takes advantage of multithreading for parallel compilation and dispatches kernel launches to multiple accelerators. This paper lays out the main design principles of the system and discusses the basic mechanics of the initial implementation. Preliminary performance results are presented, encountered challenges are discussed, and the future directions are outlined.
more »
« less
- Award ID(s):
- 1642441
- PAR ID:
- 10136174
- Date Published:
- Journal Name:
- International Conference on Parallel Processing
- Issue:
- 92
- ISSN:
- 0190-3918
- Page Range / eLocation ID:
- 1 - 10
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.more » « less
-
There are nearly one hundred parallel and distributed graph processing packages. Selecting the best package for a given problem is difficult; some packages require GPUs, some are optimized for distributed or shared memory, and some require proprietary compilers or perform better on different hardware. Furthermore, performance may vary wildly depending on the graph itself. This complexity makes selecting the optimal implementation manually infeasible. We develop an approach to predict the performance of parallel graph processing using both regression models and binary classification by labeling configurations as either well-performing or not. We demonstrate our approach on six graph processing packages: GraphMat, the Graph500, the Graph Algorithm Platform Benchmark Suite, GraphBIG, Galois, and PowerGraph and on four algorithms: PageRank, single-source shortest paths, triangle counting, and breadth first search. Given a graph, our method can estimate execution time or suggest an implementation and thread count expected to perform well. Our method correctly identifies well-performing configurations in 97% of test cases.more » « less
-
Parallel Discrete Event Simulation (PDES) using distributed synchronization supports the concurrent execution of discrete event simulation models on parallel processing hardware platforms. The multi-core/many-core era has provided a low latency “cluster on a chip” architecture for high-performance simulation and modeling of complex systems. A research many-core processor named the Single-Chip Cloud Computer (SCC) has been created by Intel Labs that contains some interesting opportunities for PDES research and development. The features of most interest in the SCC system are: low-latency messaging hardware, software managed cache coherence, and (user controllable) core independent dynamic frequency and voltage regulation capability. Ideally, each of these features provide interesting opportunities that can be exploited for improving the performance of PDES. This paper reports some preliminary efforts to migrate an optimistically synchronized parallel simulation kernel called WARPED to an SCC emulation system called Rock Creek Communication Environment (RCCE). The WARPED simulation kernel has been ported to the RCCE environment and several test simulation models have also been ported to the RCCE environment. Based on initial efforts, some preliminary insights on how to exploit some of the exotic features of SCC for increasing the performance of PDES applications is noted.more » « less
-
The Message Passing Interface (MPI) is a software platform that can utilize the parallel capabilities of most multi-processors, making it useful for teaching students about parallel and distributed computing (PDC). MPI provides language bindings for Fortran and C/C++, but many university instructors lack expertise in these languages, preventing them from using MPI in their courses. OpenMPI is a free implementation of MPI that also provides Java bindings, allowing instructors who know Java but not C/C++ or Fortran to teach PDC. However, Java has a reputation as a “slow” language, so some say it is unsuitable for teaching PDC. This paper gives a head-to-head comparison of the performance of OpenMPI's Java and C bindings. Our study shows that by default, Java can be faster than C unless one takes special measures, and it exhibits similar speedup, efficiency, and scalability. We conclude that Java is a suitable language for teaching PDC.more » « less