skip to main content

This content will become publicly available on January 1, 2025

Title: PS-IMC: A 2385.7-TOPS/W/b Precision Scalable In-Memory Computing Macro With Bit-Parallel Inputs and Decomposable Weights for DNNs
We present a fully digital multiply and accumulate (MAC) in-memory computing (IMC) macro demonstrating one of the fastest flexible precision integer-based MACs to date. The design boasts a new bit-parallel architecture enabled by a 10T bit-cell capable of four AND operations and a decomposed precision data flow that decreases the number of shift–accumulate operations, bringing down the overall adder hardware cost by 1.57× while maintaining 100% utilization for all supported precision. It also employs a carry save adder tree that saves 21% of adder hardware. The 28-nm prototype chip achieves a speed-up of 2.6× , 10.8× , 2.42× , and 3.22× over prior SoTA in 1bW:1bI, 1bW:4bI, 4bW:4bI, and 8bW:8bI MACs, respectively.  more » « less
Award ID(s):
2342726 2144751 2349802 2314591 2328803 2414603
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Date Published:
Journal Name:
IEEE Solid-State Circuits Letters
Page Range / eLocation ID:
102 to 105
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for smaller precision math like 8-bit fixed point and IEEE half-precision (fp16) in DSP slices, adding shadow multipliers in logic blocks, etc. In this paper, we describe replacing a portion of the FPGA’s programmable logic area with Tensor Slices. These slices have a systolic array of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual multipliers and MACs (multiply-and-accumulate). These slices have a local crossbar at the inputs that helps with easing the routing pressure caused by a large block on the FPGA. Adding these DL-specific coarse-grained hard blocks to FPGAs increases their compute density and makes them even better hardware accelerators for DL applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain. 
    more » « less
  2. null (Ed.)
    Multiply-accumulate (MAC) operations are common in data processing and machine learning but costly in terms of hardware usage. Stochastic Computing (SC) is a promising approach for low-cost hardware design of complex arithmetic operations such as multiplication. Computing with deterministic unary bit-streams (defined as bit-streams with all 1s grouped together at the beginning or end of a bit-stream) has been recently suggested to improve the accuracy of SC. Conventionally, SC designs use multiplexer (MUX) units or OR gates to accumulate data in the stochastic domain. MUX-based addition suffers from scaling of data and OR-based addition from inaccuracy. This work proposes a novel technique for MAC operation on unary bit-streamsthat allows exact, non-scaled addition of multiplication results. By introducing a relative delay between the products, we control correlation between bit-streams and eliminate OR-based addition error. We evaluate the accuracy of the proposed technique compared to the state-of-the-art MAC designs. After quantization, the proposed technique demonstrates at least 37% and up to 100% decrease of the mean absolute error for uniformly distributed random input values, compared to traditional OR-based MAC designs. Further, we demonstrate that the proposed technique is practical and evaluate area, power and energy of three possible implementations. 
    more » « less
  3. null (Ed.)
    Xilinx’s AI Engine is a recent industry example of energy-efficient vector processing that includes novel support for 2D SIMD datapaths and shuffle interconnection network. The current approach to programming the AI Engine relies on a C/C++ API for vector intrinsics. While an advance over assembly- level programming, it requires the programmer to specify a number of low-level operations based on detailed knowledge of the hardware. To address these challenges, we introduce Vyasa, a new programming system that extends the Halide DSL compiler to automatically generate code for the AI Engine. We evaluated Vyasa on 36 CONV2D workloads, and achieved geometric means of 7.6 and 24.2 MACs/cycle for 32-bit and 16-bit operands (which represent 95.9% and 75.6% of the peak performance respectively). 
    more » « less
  4. The rapidly increasing size of deep-learning models has renewed interest in alternatives to digital-electronic computers as a means to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for them. In this paper, we investigate---through a combination of simulations and experiments on prototype optical hardware---the feasibility and potential energy benefits of running Transformer models on future optical accelerators that perform matrix-vector multiplication. We use simulations, with noise models validated by small-scale optical experiments, to show that optical accelerators for matrix-vector multiplication should be able to accurately run a typical Transformer architecture model for language processing. We demonstrate that optical accelerators can achieve the same (or better) perplexity as digital-electronic processors at 8-bit precision, provided that the optical hardware uses sufficiently many photons per inference, which translates directly to a requirement on optical energy per inference. We studied numerically how the requirement on optical energy per inference changes as a function of the Transformer width $d$ and found that the optical energy per multiply--accumulate (MAC) scales approximately as $\frac{1}{d}$, giving an asymptotic advantage over digital systems. We also analyze the total system energy costs for optical accelerators running Transformers, including both optical and electronic costs, as a function of model size. We predict that well-engineered, large-scale optical hardware should be able to achieve a $100 \times$ energy-efficiency advantage over current digital-electronic processors in running some of the largest current Transformer models, and if both the models and the optical hardware are scaled to the quadrillion-parameter regime, optical accelerators could have a $>8,000\times$ energy-efficiency advantage. Under plausible assumptions about future improvements to electronics and Transformer quantization techniques (5× cheaper memory access, double the digital--analog conversion efficiency, and 4-bit precision), we estimate that the energy advantage for optical processors versus electronic processors operating at 300~fJ/MAC could grow to $>100,000\times$. 
    more » « less
  5. Abstract

    As the energy and hardware investments necessary for conventional high‐precision digital computing continue to explode in the era of artificial intelligence (AI), a change in paradigm that can trade precision for energy and resource efficiency is being sought for many computing applications. Stochastic computing (SC) is an attractive alternative since, unlike digital computers, which require many logic gates and a high transistor volume to perform basic arithmetic operations such as addition, subtraction, multiplication, sorting, etc., SC can implement the same using simple logic gates. While it is possible to accelerate SC using traditional silicon complementary metal–oxide–semiconductor (CMOS) technology, the need for extensive hardware investment to generate stochastic bits (s‐bits), the fundamental computing primitive for SC, makes it less attractive. Memristor and spin‐based devices offer natural randomness but depend on hybrid designs involving CMOS peripherals for accelerating SC, which increases area and energy burden. Here, the limitations of existing and emerging technologies are overcome, and a standalone SC architecture embedded in memory and based on 2D memtransistors is experimentally demonstrated. The monolithic and non‐von‐Neumann SC architecture occupies a small hardware footprint and consumes a miniscule amount of energy (<1 nJ) for both s‐bit generation and arithmetic operations, highlighting the benefits of SC.

    more » « less