NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching

https://doi.org/10.1145/3579371.3589097

Nagendra, Nayana Prasad; Godala, Bhargav Reddy; Chaturvedi, Ishita; Patel, Atmn; Kanev, Svilen; Moseley, Tipp; Stark, Jared; Pokam, Gilles A.; Campanoni, Simone; August, David I. (June 2023, Proceedings of the 50th International Symposium on Computer Architecture (ISCA))

Full Text Available
SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development

https://doi.org/10.1145/3582016.3582058

Tan, Zujun; Chon, Yebin; Kruse, Michael; Doerfert, Johannes; Xu, Ziyang; Homerding, Brian; Campanoni, Simone; August, David I. (March 2023, International Conference on Architectural Support for Programming Languages and Operating Systems)

Manually writing parallel programs is difficult and error-prone. Automatic parallelization could address this issue, but profitability can be limited by not having facts known only to the programmer. A parallelizing compiler that collaborates with the programmer can increase the coverage and performance of parallelization while reducing the errors and overhead associated with manual parallelization. Unlike collaboration involving analysis tools that report program properties or make parallelization suggestions to the programmer, decompiler-based collaboration could leverage the strength of existing parallelizing compilers to provide programmers with a natural compiler-parallelized starting point for further parallelization or refinement. Despite this potential, existing decompilers fail to do this because they do not generate portable parallel source code compatible with any compiler of the source language. This paper presents SPLENDID, an LLVM-IR to C/OpenMP decompiler that enables collaborative parallelization by producing standard parallel OpenMP code. Using published manual parallelization of the PolyBench benchmark suite as a reference, SPLENDID's collaborative approach produces programs twice as fast as either Polly-based automatic parallelization or manual parallelization alone. SPLENDID's portable parallel code is also more natural than that from existing decompilers, obtaining a 39x higher average BLEU score.
more » « less
Full Text Available
Program State Element Characterization

https://doi.org/10.1145/3579990.3580011

Deiana, Enrico Armenio; Suchy, Brian; Wilkins, Michael; Homerding, Brian; McMichen, Tommy; Dunajewski, Katarzyna; Dinda, Peter; Hardavellas, Nikos; Campanoni, Simone (February 2023, International Symposium on Code Generation and Optimization)

Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions.
more » « less
Full Text Available
WARDen: Specializing Cache Coherence for High-Level Parallel Languages

https://doi.org/10.1145/3579990.3580013

Wilkins, Michael; Westrick, Sam; Kandiah, Vijay; Bernat, Alex; Suchy, Brian; Deiana, Enrico Armenio; Campanoni, Simone; Acar, Umut A.; Dinda, Peter; Hardavellas, Nikos (February 2023, Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization)

High-level parallel languages (HLPLs) make it easier to write correct parallel programs. Disciplined memory usage in these languages enables new optimizations for hardware bottlenecks, such as cache coherence. In this work, we show how to reduce the costs of cache coherence by integrating the hardware coherence protocol directly with the programming language; no programmer effort or static analysis is required. We identify a new low-level memory property, WARD (WAW Apathy and RAW Dependence-freedom), by construction in HLPL programs. We design a new coherence protocol, WARDen, to selectively disable coherence using WARD. We evaluate WARDen with a widely-used HLPL benchmark suite on both current and future x64 machine structures. WARDen both accelerates the benchmarks (by an average of 1.46x) and reduces energy (by 23%) by eliminating unnecessary data movement and coherency messages.
more » « less
Full Text Available
WARio: efficient code generation for intermittent computing

https://doi.org/10.1145/3519939.3523454

Kortbeek, Vito; Ghosh, Souradip; Hester, Josiah; Campanoni, Simone; Pawełczak, Przemysław (June 2022, Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation)

Intermittently operating embedded computing platforms powered by energy harvesting require software frameworks to protect from errors caused by Write After Read (WAR) dependencies. A powerful method of code protection for systems with non-volatile main memory utilizes compiler analysis to insert a checkpoint inside each WAR violation in the code. However, such software frameworks are oblivious to the code structure---and therefore, inefficient---when many consecutive WAR violations exist. Our insight is that by transforming the input code, i.e., moving individual write operations from unique WARs close to each other, we can significantly reduce the number of checkpoints. This idea is the foundation for WARio: a set of compiler transformations for efficient code generation for intermittent computing. WARio, on average, reduces checkpoint overhead by 58%, and up to 88%, compared to the state of the art across various benchmarks.
more » « less
Full Text Available
NOELLE Offers Empowering LLVM Extensions

https://doi.org/10.1109/CGO53902.2022.9741276

Matni, Angelo; Deiana, Enrico Armenio; Su, Yian; Gross, Lukas; Ghosh, Souradip; Apostolakis, Sotiris; Xu, Ziyang; Tan, Zujun; Chaturvedi, Ishita; Homerding, Brian; et al (April 2022, 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO))

Modern and emerging architectures demand increasingly complex compiler analyses and transformations. As the emphasis on compiler infrastructure moves beyond support for peephole optimizations and the extraction of instruction-level parallelism, compilers should support custom tools designed to meet these demands with higher-level analysis-powered abstractions and functionalities of wider program scope. This paper introduces NOELLE, a robust open-source domain-independent compilation layer built upon LLVM providing this support. NOELLE extends abstractions and functionalities provided by LLVM enabling advanced, program-wide code analyses and transformations. This paper shows the power of NOELLE by presenting a diverse set of 11 custom tools built upon it.
more » « less
Full Text Available
CARAT CAKE: replacing paging via compiler/kernel cooperation

https://doi.org/10.1145/3503222.3507771

Suchy, Brian; Ghosh, Souradip; Kersnar, Drew; Chai, Siyuan; Huang, Zhen; Nelson, Aaron; Cuevas, Michael; Bernat, Alex; Chaudhary, Gaurav; Hardavellas, Nikos; et al (February 2022, Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.
more » « less
Full Text Available
Paths to OpenMP in the kernel

https://doi.org/10.1145/3458817.3476183

Ma, Jiacheng; Wang, Wenyi; Nelson, Aaron; Cuevas, Michael; Homerding, Brian; Liu, Conghao; Huang, Zhen; Campanoni, Simone; Hale, Kyle; Dinda, Peter (November 2021, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21))

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.
more » « less
Full Text Available
Quantifying the Semantic Gap Between Serial and Parallel Programming

https://doi.org/10.1109/IISWC53511.2021.00024

Zhang, Xiaochun; Jones, Timothy M.; Campanoni, Simone (November 2021, 2021 IEEE International Symposium on Workload Characterization (IISWC))

Automatic parallelizing compilers are often constrained in their transformations because they must conservatively respect data dependences within the program. Developers, on the other hand, often take advantage of domain-specific knowledge to apply transformations that modify data dependences but respect the application's semantics. This creates a semantic gap between the parallelism extracted automatically by compilers and manually by developers. Although prior work has proposed programming language extensions to close this semantic gap, their relative contribution is unclear and it is uncertain whether compilers can actually achieve the same performance as manually parallelized code when using them. We quantify this semantic gap in a set of sequential and parallel programs and leverage these existing programming-language extensions to empirically measure the impact of closing it for an automatic parallelizing compiler. This lets us achieve an average speedup of 12.6× on an Intel-based 28-core machine, matching the speedup obtained by the manually parallelized code. Further, we apply these extensions to widely used sequential system tools, obtaining 7.1× speedup on the same system.
more » « less
Full Text Available
Task parallel assembly language for uncompromising parallelism

https://doi.org/10.1145/3453483.3460969

Rainey, Mike; Newton, Ryan R.; Hale, Kyle; Hardavellas, Nikos; Campanoni, Simone; Dinda, Peter; Acar, Umut A. (June 2021, PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records