We propose a numerical integration accelerator (INTIACC) that speeds up the solution of partial differential equations (PDEs) for scientific computing. In contrast to recent works, INTIACC applies to a variety of PDEs and boundary conditions, has enhanced nonlinear function capability, supports high-order integration algorithms, and uses floating-point arithmetic for orders of magnitude smaller solution error. With all the benefits, our test chip still achieves 40X speed-up over prior accelerators and orders of magnitudes over CPU and GPU based systems.
more »
« less
INTIACC: A Programmable Floating-Point Accelerator for Partial Differential Equations
This article presents a 32-bit floating-point (FP32) programmable accelerator for solving a wide range of partial differential equations (PDEs) based on numerical integration methods. Compared to prior works that have fixed-point systems and are only applicable to specific types of PDEs, our proposed, integration accelerator for PDEs, named INTIACC, accelerator consists of 16 locally interconnected processing elements (PEs) where each PE is a fully programmable reduced instruction set computer (RISC) processor with an FP32 arithmetic logic unit (FP32 ALU) and a custom-designed instruction set architecture (ISA). These features enable INTIACC to generate solutions with high precision and a wide dynamic range and also allow users to implement different numerical algorithms to perform high-order integration methods and to evaluate nonlinear functions. In addition, we create a novel slow-global-fast-local clocking scheme in which PEs operate asynchronously with each other most of the time. We prototype the INTIACC test chip in 65 nm, with a core area of 0.975 mm2. Running at an average local clock frequency of 570 MHz at 1 V, it offers a single-precision computation throughput of 9.12 GFLOPS. Testing results show that with a similar energy-delay product, INTIACC is up to 40× faster than the prior state-of-the-art PDE solver.
more »
« less
- Award ID(s):
- 1840763
- PAR ID:
- 10521633
- Editor(s):
- Mathew, Sanu
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- IEEE Journal of Solid-State Circuits
- ISSN:
- 0018-9200
- Page Range / eLocation ID:
- 1 to 12
- Subject(s) / Keyword(s):
- 32-bit floating point, boundary conditions (BCs), custom instruction set architecture (ISA), hybrid global-local clocking scheme, numerical integration, partial differential equations (PDEs), programmable accelerator
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Fixed-point fast sweeping WENO methods are a class of efficient high-order numerical methods to solve steady-state solutions of hyperbolic partial differential equations (PDEs). The Gauss-Seidel iterations and alternating sweeping strategy are used to cover characteristics of hyperbolic PDEs in each sweeping order to achieve fast convergence rate to steady-state solutions. A nice property of fixed-point fast sweeping WENO methods which distinguishes them from other fast sweeping methods is that they are explicit and do not require inverse operation of nonlinear local systems. Hence, they are easy to be applied to a general hyperbolic system. To deal with the difficulties associated with numerical boundary treatment when high-order finite difference methods on a Cartesian mesh are used to solve hyperbolic PDEs on complex domains, inverse Lax-Wendroff (ILW) procedures were developed as a very effective approach in the literature. In this paper, we combine a fifth-order fixed-point fast sweeping WENO method with an ILW procedure to solve steady-state solution of hyperbolic conservation laws on complex computing regions. Numerical experiments are performed to test the method in solving various problems including the cases with the physical boundary not aligned with the grids. Numerical results show high-order accuracy and good performance of the method. Furthermore, the method is compared with the popular third-order total variation diminishing Runge-Kutta (TVD-RK3) time-marching method for steady-state computations. Numerical examples show that for most of examples, the fixed-point fast sweeping method saves more than half CPU time costs than TVD-RK3 to converge to steady-state solutions.more » « less
-
This article presents TULIP, a new architecture for a variable precision quantized neural network (QNN) inference. It is designed with the goal of maximizing energy efficiency per classification. TULIP is constructed by arranging a collection of unique processing elements (TULIP-PEs) in a single-instruction–multiple-data (SIMD) fashion. Each TULIP-PE contains binary neurons that are interconnected using multiplexers. Each neuron also has a small dedicated local register connected to it. The binary neurons are implemented as standard cells and used for implementing threshold functions, i.e., an inner-product and thresholding operation on its binary inputs. The neurons can be reconfigured with a single change in the control signals to implement all the standard operations used in a QNN. This article presents novel algorithms for implementing the operations of a QNN on the TULIP-PEs in the form of a schedule of threshold functions. TULIP was implemented as an ASIC in TSMC 40nm-LP technology. A QNN accelerator that employs a conventional multiply and accumulate-based arithmetic processor was also implemented in the same technology to provide a fair comparison. The results show that TULIP is 30X−50X more energy-efficient than an equivalent design, without any penalty in performance, area, or accuracy. Furthermore, TULIP achieves these improvements without using traditional techniques such as voltage scaling or approximate computing. Finally, this article also demonstrates how the run-time tradeoff between accuracy and energy efficiency is done on the TULIP architecture.more » « less
-
As the increasing complexity of Neural Network(NN) models leads to high demands for computation, AMD introduces a heterogeneous programmable system-on-chip (SoC), i.e., Versal ACAP architectures featured with programmable logic(PL), CPUs, and dedicated AI engines (AIE) ASICs which has a theoretical throughput up to 6.4 TFLOPs for FP32, 25.6 TOPs for INT16 and 102.4 TOPs for INT8. However, the higher level of complexity makes it non-trivial to achieve the theoretical performance even for well-studied applications like matrix-matrix multiply. In this paper, we provide AutoMM, an automatic white-box framework that can systematically generate the design for MM accelerators on Versal which achieves 3.7 TFLOPs, 7.5 TOPs, and 28.2 TOPs for FP32, INT16, and INT8 data type respectively. Our designs are tested on board and achieve gains of 7.20x (FP32), 3.26x (INT16), 6.23x (INT8) energy efficiency than AMD U250, 2.32x (FP32) than Nvidia Jetson TX2, 1.06x (FP32), 1.70x (INT8) than Nvidia A100.more » « less
-
Sparse data structures like hash tables, trees, or compressed tensors are ubiquitous, but operations on these structures are expensive and inefficient on current systems. Prior work has proposed hardware acceleration for these operations, but these techniques have two key shortcomings: they limit the types of data structures they support, and they focus on reads but do not support fine-grained updates to these structures. We present Terminus, a programmable accelerator for read and update operations on sparse data structures. Terminus extends each general-purpose core with a programmable dataflow engine capable of accelerating a wide range of structures and operations. Terminus engines are flexible yet simple, as they focus on common operations and defer rare, complex ones to cores. Terminus features a simple concurrency control mechanism based on address ranges that enables safe updates while preserving parallelism. We evaluate Terminus on serial and parallel benchmarks on a wide range of sparse data structures. Terminus improves performance by gmean 7.4x over a CPU baseline, showing that Terminus can accelerate fine-grained reads and writes that were previously not possible in prior accelerators for sparse structures.more » « less
An official website of the United States government

