skip to main content

This content will become publicly available on March 25, 2024

Title: SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development
Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.  more » « less
Award ID(s):
2118708 2107042 1908488
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
International Conference on Architectural Support for Programming Languages and Operating Systems
Page Range / eLocation ID:
679 to 693
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system. 
    more » « less
  2. Computer scientists and programmers face the difficultly of improving the scalability of their applications while using conventional programming techniques only. As a base-line hypothesis of this paper we assume that an advanced runtime system can be used to take full advantage of the available parallel resources of a machine in order to achieve the highest parallelism possible. In this paper we present the capabilities of HPX - a distributed runtime system for parallel applications of any scale - to achieve the best possible scalability through asynchronous task execution [1]. OP2 is an active library which provides a framework for the parallel execution for unstructured grid applications on different multi-core/many-core hardware architectures [2]. OP2 generates code which uses OpenMP for loop parallelization within an application code for both single-threaded and multi-threaded machines. In this work we modify the OP2 code generator to target HPX instead of OpenMP, i.e. port the parallel simulation backend of OP2 to utilize HPX. We compare the performance results of the different parallelization methods using HPX and OpenMP for loop parallelization within the Airfoil application. The results of strong scaling and weak scaling tests for the Airfoil application on one node with up to 32 threads are presented. Using HPX for parallelization of OP2 gives an improvement in performance by 5%-21%. By modifying the OP2 code generator to use HPX's parallel algorithms, we observe scaling improvements by about 5% as compared to OpenMP. To fully exploit the potential of HPX, we adapted the OP2 API to expose a future and dataflow based programming model and applied this technique for parallelizing the same Airfoil application. We show that the dataflow oriented programming model, which automatically creates an execution tree representing the algorithmic data dependencies of our application, improves the overall scaling results by about 21% compared to OpenMP. Our results show the advantage of using the asynchronous programming model implemented by HPX. 
    more » « less
  3. Derivatives are key to numerous science, engineering, and machine learning applications. While existing tools generate derivatives of programs in a single language, modern parallel applications combine a set of frameworks and languages to leverage available performance and function in an evolving hardware landscape. We propose a scheme for differentiating arbitrary DAG-based parallelism that preserves scalability and efficiency, implemented into the LLVM-based Enzyme automatic differentiation framework. By integrating with a full-fledged compiler backend, Enzyme can differentiate numerous parallel frameworks and directly control code generation. Combined with its ability to differentiate any LLVM-based language, this flexibility permits Enzyme to leverage the compiler tool chain for parallel and differentiation-specific optimizations. We differentiate nine distinct versions of the LULESH and miniBUDE applications, written in different programming languages (C++, Julia) and parallel frameworks (OpenMP, MPI, RAJA, Julia tasks, MPI.jl), demonstrating similar scalability to the original program. On benchmarks with 64 threads or nodes, we find a differentiation overhead of 3.4 - 6.8× on C++ and 5.4 - 12.5× on Julia. 
    more » « less
  4. Machine learning (ML) training is commonly parallelized using data parallelism. A fundamental limitation of data parallelism is that conflicting (concurrent) parameter accesses during ML training usually diminishes or even negates the benefits provided by additional parallel compute resources. Although it is possible to avoid conflicting parameter accesses by carefully scheduling the computation, existing systems rely on programmer manual parallelization and it remains a question when such parallelization is possible. We present Orion, a system that automatically parallelizes serial imperative ML programs on distributed shared memory. The core of Orion is a static dependence analysis mechanism that determines when dependence-preserving parallelization is effective and maps a loop computation to an optimized distributed computation schedule. Our evaluation shows that for a number of ML applications, Orion can parallelize a serial program while preserving critical dependences and thus achieve a significantly faster convergence rate than data-parallel programs and a matching convergence rate and comparable computation throughput to state-of-the-art manual parallelizations including model-parallel programs. 
    more » « less
  5. Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. 
    more » « less