skip to main content


Title: A compiler infrastructure for accelerator generators
We present Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. Calyx combines a hardware-like structural language with a software-like control flow representation with loops and conditionals. This split representation enables a new class of hardware-focused optimizations that require both structural and control flow information which are crucial for high-level programming models for hardware design. The Calyx compiler lowers control flow constructs using finite-state machines and generates synthesizable hardware descriptions. We have implemented Calyx in an optimizing compiler that translates high-level programs to hardware. We demonstrate Calyx using two DSL-to-RTL compilers, a systolic array generator and one for a recent imperative accelerator language, and compare them to equivalent designs generated using high-level synthesis (HLS). The systolic arrays are 4.6× faster and 1.11× larger on average than HLS implementations, and the HLS-like imperative language compiler is within a few factors of a highly optimized commercial HLS toolchain. We also describe three optimizations implemented in the Calyx compiler.  more » « less
Award ID(s):
1845952 1723715
NSF-PAR ID:
10232894
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021)
Page Range / eLocation ID:
804 to 817
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As customized accelerator design has become increasingly popular to keep up with the demand for high performance computing, it poses challenges for modern simulator design to adapt to such a large variety of accelerators. Existing simulators tend to two extremes: low-level and general approaches, such as RTL simulation, that can model any hardware but require substantial effort and long execution times; and higher-level application-specific models that can be much faster and easier to use but require one-off engineering effort.This work proposes a compiler-driven simulation workflow that can model configurable hardware accelerator. The key idea is to separate structure representation from simulation by developing an intermediate language that can flexibly represent a wide variety of hardware constructs. We design the Event Queue (EQueue) dialect of MLIR, a dialect that can model arbitrary hardware accelerators with explicit data movement and distributed event-based control; we also implement a generic simulation engine to model EQueue programs with hybrid MLIR dialects representing different abstraction levels. We demonstrate two case studies of EQueue-implemented accelerators: the systolic array of convolution and SIMD processors in a modern FPGA. In the former we show EQueue simulation is as accurate as a state-of-the-art simulator, while offering higher extensibility and lower iteration cost via compiler passes. In the latter we demonstrate our simulation flow can guide designer efficiently improve their design using visualizable simulation outputs. 
    more » « less
  2. Field-programmable gate arrays (FPGAs) provide an opportunity to co-design applications with hardware accelerators, yet they remain difficult to program. High-level synthesis (HLS) tools promise to raise the level of abstraction by compiling C or C++ to accelerator designs. Repurposing legacy software languages, however, requires complex heuristics to map imperative code onto hardware structures. We find that the black-box heuristics in HLS can be unpredictable: changing parameters in the program that should improve performance can counterintuitively yield slower and larger designs. This paper proposes a type system that restricts HLS to programs that can predictably compile to hardware accelerators. The key idea is to model consumable hardware resources with a time-sensitive affine type system that prevents simultaneous uses of the same hardware structure. We implement the type system in Dahlia, a language that compiles to HLS C++, and show that it can reduce the size of HLS parameter spaces while accepting Pareto-optimal designs. 
    more » « less
  3. Graph-based intermediate representations (IRs) are widely used for powerful compiler optimizations, either interprocedurally in pure functional languages, or intraprocedurally in imperative languages. Yet so far, no suitable graph IR exists for aggressive global optimizations in languages with both effects and higher-order functions: aliasing and indirect control transfers make it difficult to maintain sufficiently granular dependency information for optimizations to be effective. To close this long-standing gap, we propose a novel typed graph IR combining a notion of reachability types with an expressive effect system to compute precise and granular effect dependencies at an affordable cost while supporting local reasoning and separate compilation. Our high-level graph IR imposes lexical structure to represent structured control flow and nesting, enabling aggressive and yet inexpensive code motion and other optimizations for impure higher-order programs. We formalize the new graph IR based on a λ-calculus with a reachability type-and-effect system along with a specification of various optimizations. We present performance case studies for tensor loop fusion, CUDA kernel fusion, symbolic execution of LLVM IR, and SQL query compilation in the Scala LMS compiler framework using the new graph IR. We observe significant speedups of up to 21x.

     
    more » « less
  4. Modern programmable network switches can implement cus- tom applications using efficient packet processing hardware, and the programming language P4 provides high-level con- structs to program such switches. The increase in speed and programmability has inspired research in dataplane program- ming, where many complex functionalities, e.g., key-value stores and load balancers, can be implemented entirely in network switches. However, dataplane programs may suffer from novel security errors that are not traditionally found in network switches. To address this issue, we present a new information-flow control type system for P4. We formalize our type system in a recently-proposed core version of P4, and we prove a sound- ness theorem: well-typed programs satisfy non-interference. We also implement our type system in a tool, P4BID, which extends the type checker in the p4c compiler, the reference compiler for the latest version of P4. We present several case studies showing that natural security, integrity, and isolation properties in networks can be captured by non-interference, and our type system can detect violations of these properties while certifying correct programs. 
    more » « less
  5. High-Level Synthesis (HLS) has enabled users to rapidly develop designs targeted for FPGAs from the behavioral description of the design. However, to synthesize an optimal design capable of taking better advantage of the target FPGA, a considerable amount of effort is needed to transform the initial behavioral description into a form that can capture the desired level of parallelism. Thus, a design space exploration (DSE) engine capable of optimizing large complex designs is needed to achieve this goal. We present a new DSE engine capable of considering code transformation, compiler directives (pragmas), and the compatibility of these optimizations. To accomplish this, we initially express the structure of the input code as a graph to guide the exploration process. To appropriately transform the code, we take advantage of ScaleHLS based on the multi-level compiler infrastructure (MLIR). Finally, we identify problems that limit the scalability of existing DSEs, which we name the “design space merging problem.” We address this issue by employing a Random Forest classifier that can successfully decrease the number of invalid design points without invoking the HLS compiler as a validation tool. We evaluated our DSE engine against the ScaleHLS DSE, outperforming it by a maximum of 59×. We additionally demonstrate the scalability of our design by applying our DSE to large-scale HLS designs, achieving a maximum speedup of 12× for the benchmarks in the MachSuite and Rodinia set.

     
    more » « less