skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Compiler-Driven Simulation of Reconfigurable Hardware Accelerators
As customized accelerator design has become increasingly popular to keep up with the demand for high performance computing, it poses challenges for modern simulator design to adapt to such a large variety of accelerators. Existing simulators tend to two extremes: low-level and general approaches, such as RTL simulation, that can model any hardware but require substantial effort and long execution times; and higher-level application-specific models that can be much faster and easier to use but require one-off engineering effort.This work proposes a compiler-driven simulation workflow that can model configurable hardware accelerator. The key idea is to separate structure representation from simulation by developing an intermediate language that can flexibly represent a wide variety of hardware constructs. We design the Event Queue (EQueue) dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control; we also implement a generic simulation engine to model EQueue programs with hybrid MLIR dialects representing different abstraction levels. We demonstrate two case studies of EQueue-implemented accelerators: the systolic array of convolution and SIMD processors in a modern FPGA. In the former we show EQueue simulation is as accurate as a state-of-the-art simulator, while offering higher extensibility and lower iteration cost via compiler passes. In the latter we demonstrate our simulation flow can guide designer efficiently improve their design using visualizable simulation outputs.  more » « less
Award ID(s):
1845952 1723715
PAR ID:
10335976
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
Page Range / eLocation ID:
619 to 632
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cathie Olschanowsky (Ed.)
    The Sparse Polyhedral Framework (SPF) provides vital support to scientific applications, but is limited in portability. SPF extends the Polyhedral Model to non-affine codes. Scientific applications need the optimizations SPF enables, but current SPF tools don’t support GPUs or other heterogeneous hardware targets. As clock speeds continue to stagnate, scientific applications need the performance enhancements enabled by both SPF and newer heterogeneous hardware. The MLIR (Multi-Level Intermediate Representation) ecosystem offers a large, extensible, and cooperating set of intermediate representations (called dialects). A typical compiler has one main intermediate representation, whereas an MLIR based compiler will have many. Because of this flexibility, the MLIR ecosystem has many dialects designed with heterogeneous hardware platforms in mind. This work creates an MLIR SPF dialect. The dialect enables SPF optimizations and is capable of generating GPU code as well as CPU code from SPF representations. Previous C based SPF front ends are not capable of generating GPU code. The SPF dialect representations of common sparse scientific kernels generate CPU code competitive with the existing C based front end, and GPU code competitive with standard benchmarks. 
    more » « less
  2. Compiler technologies in deep learning and domain-specific hardware acceleration are increasingly adopting extensible compiler frameworks such as Multi-Level Intermediate Representation (MLIR) to facilitate more efficient development. With MLIR, compiler developers can easily define their own custom IRs in the form of MLIR dialects. However, the diversity and rapid evolution of such custom IRs make it impractical to manually write a custom test generator for each dialect. To address this problem, we design a new test generator called SynthFuzz that combines grammar-based fuzzing with custom mutation synthesis. The key essence of SynthFuzz is two fold: (1) It automatically infers parameterized context-dependent custom mutations from existing test cases. (2) It then concretizes the mutation's content depending on the target context and reduces the chance of inserting invalid edits by performing k - ancestor and prefix/postfix matching. It obviates the need to manually define custom mutation operators for each dialect. We compare SynthFuzz to three baselines: Grammarinator-a grammar-based fuzzer without custom mutations, MLIRSmith-a custom test generator for MLIR core dialects, and NeuRI-a custom test generator for ML models with parameterization of tensor shapes. We conduct this comprehensive comparison on four different MLIR projects. Each project defines a new set of MLIR dialects where manually writing a custom test generator would take weeks of effort. Our evaluation shows that SynthFuzz on average improves MLIR dialect pair coverage by 1.75 ×, which increases branch coverage by 1.22 ×. Further, we show that our context dependent custom mutation increases the proportion of valid tests by up to 1.11 ×, indicating that SynthFuzz correctly concretizes its parameterized mutations with respect to the target context. Parameterization of the mutations reduces the fraction of tests violating the base MLIR constraints by 0.57 ×, increasing the time spent fuzzing dialect-specific code. 
    more » « less
  3. In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton, CIRCT, and ONNX-MLIR. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual-dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. 
    more » « less
  4. In recent years, the MLIR framework has had explosive growth due to the need for extensible deep learning compilers for hardware accelerators. Such examples include Triton [39], CIRCT [14], and ONNX-MLIR [22]. MLIR compilers introduce significant complexities in localizing bugs or inefficiencies because of their layered optimization and transformation process with compilation passes. While existing delta debugging techniques can be used to identify a minimum subset of IR code that reproduces a given bug symptom, their naive application to MLIR is time-consuming because real-world MLIR compilers usually involve a large number of compilation passes. Compiler developers must identify a minimized set of relevant compilation passes to reduce the footprint of MLIR compiler code to be inspected for a bug fix. We propose DuoReduce, a dual- dimensional reduction approach for MLIR bug localization. DuoReduce leverages three key ideas in tandem to design an efficient MLIR delta debugger. First, DuoReduce reduces compiler passes that are irrelevant to the bug by identifying ordering dependencies among the different compilation passes. Second, DuoReduce uses MLIR-semantics-aware transformations to expedite IR code reduction. Finally, DuoReduce leverages cross-dependence between the IR code dimension and the compilation pass dimension by accounting for which IR code segments are related to which compilation passes to reduce unused passes. Experiments with three large-scale MLIR compiler projects find that DuoReduce outperforms syntax-aware reducers such as Perses and Vulcan in terms of IR code reduction by 31.6% and 21.5% respectively. If one uses these reducers by enumerating all possible compilation passes (on average 18 passes), it could take up to 145 hours. By identifying ordering dependencies among compilation passes, DuoReduce reduces this time to 9.5 minutes. By identifying which compilation passes are unused for compiling reduced IR code, DuoReduce reduces the number of passes by 14.6%. This translates to not needing to examine 281 lines of MLIR compiler code on average to fix the bugs. DuoReduce has the potential to significantly reduce debugging effort in MLIR compilers, which serves as the foundation for the current landscape of machine learning and hardware accelerators. 
    more » « less
  5. The efficiency of an accelerator depends on three factors—mapping, deep neural network (DNN) layers, and hardware—constructing extremely complicated design space of DNN accelerators. To demystify such complicated design space and guide the DNN accelerator design for better efficiency, we propose an analytical cost model, MAESTRO. MAESTRO receives DNN model description and hardware resources information as a list, and mapping described in a data-centric representation we propose as inputs. The data centric representation consists of three directives that enable concise description of mappings in a compiler-friendly form. MAESTRO analyzes various forms of data reuse in an accelerator based on inputs quickly and generates more than 20 statistics including total latency, energy, throughput, etc., as outputs. MAESTRO’s fast analysis enables various optimization tools for DNN accelerators such as hardware design exploration tool we present as an example. 
    more » « less