skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Accelerating 1-Bit Llms Via in-Memory Computing Architectures
In this paper, we present a novel hybrid computing architecture designed to accelerate inference in 1-bit large language models (LLMs). Our approach combines the strengths of analog in-memory computing (IMC) and digital systolic arrays to address the diverse precision requirements across different layers of 1-bit LLMs. Specifically, we utilize analog IMC to accelerate low-precision matrix multiplication (MatMul) operations within the projection layers, which are naturally amenable to extreme quantization. Meanwhile, digital systolic arrays are employed to efficiently handle high-precision MatMul operations in the attention heads, preserving accuracy where precision is most critical. By partitioning the computational workload based on precision needs, our hybrid architecture increases throughput and energy efficiency. Experimental evaluations demonstrate that our design delivers up to an 80x improvement in tokens processed per second and achieves a 70% increase in energy efficiency (tokens per joule) when compared to conventional digital hardware accelerators.  more » « less
Award ID(s):
2409697 2340249
PAR ID:
10674875
Author(s) / Creator(s):
 ;  
Publisher / Repository:
IEEE
Date Published:
Journal Name:
Conference proceedings
ISSN:
1558-3899
ISBN:
979-8-3315-8934-9
Page Range / eLocation ID:
178 to 182
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Hyperdimensional computing (HDC) is a brain-inspired computational framework that relies on long hypervectors (HVs) for learning. In HDC, computational operations consist of simple manipulations of hypervectors and can be incredibly memory-intensive. In-memory computing (IMC) can greatly improve the efficiency of HDC by reducing data movement in the system. Most existing IMC implementations of HDC are limited to binary precision which inhibits the ability to match software-equivalent accuracies. Moreover, memory arrays used in IMC are restricted in size and cannot immediately support the direct associative search of large binary HVs (a ubiquitous operation, often over 10,000+ dimensions) required to achieve acceptable accuracies. We present a multi-bit IMC system for HDC using ferroelectric field-effect transistors (FeFETs) that simultaneously achieves software-equivalent-accuracies, reduces the dimensionality of the HDC system, and improves energy consumption by 826x and latency by 30x when compared to a GPU baseline. Furthermore, for the first time, we experimentally demonstrate multi-bit, array-level content-addressable memory (CAM) operations with FeFETs. We also present a scalable and efficient architecture based on CAMs which supports the associative search of large HVs. Furthermore, we study the effects of device, circuit, and architectural-level non-idealities on application-level accuracy with HDC. 
    more » « less
  2. We present an evolution of multiple-input multiple-output (MIMO) wireless communications known as the tri-hybrid MIMO architecture. In this framework, the traditional operations of linear precoding at the transmitter are distributed across digital beamforming, analog beamforming, and reconfigurable antennas. Compared with the hybrid MIMO architecture, which combines digital and analog beamforming, the tri-hybrid approach introduces a third layer of electromagnetic beamforming through antenna reconfigurability. This added layer offers a pathway to scale MIMO spatial dimensions, important for 6G systems operating in centimeter-wave bands, where the tension between larger bandwidths and infrastructure reuse necessitates ultra-large antenna arrays. We introduce the key features of the tri-hybrid architecture by (i) reviewing the benefits and challenges of communicating with reconfigurable antennas, (ii) examining tradeoffs between spectral and energy efficiency enabled by reconfigurability, and (iii) exploring configuration challenges across the three layers. Overall, the tri-hybrid MIMO architecture offers a new approach for integrating emerging antenna technologies in the MIMO precoding framework. 
    more » « less
  3. In-memory computing (IMC) provides energy- efficient solutions to deep neural networks (DNN). Most IMC de- signs for DNNs employ fixed-point precisions. However, floating- point precision is still required for DNN training and complex inference models to maintain high accuracy. There have not been float-point precision based IMC works in the literature where the float-point computation is immersed into the weight memory storage. In this work, we propose a novel floating-point precision IMC macro with a configurable architecture that supports both normal 8-bit floating point (FP8) and 8-bit block floating point (BF8) with a shared exponent. The proposed FP-IMC macro implemented in 28nm CMOS demonstrates 12.1 TOPS/W for FP8 precision and 66.6 TOPS/W for BF8 precision, improving energy-efficiency beyond the state-of-the-art FP IMC macros. 
    more » « less
  4. null (Ed.)
    The increasingly central role of speech based human computer interaction necessitates on-device, low-latency, low-power, high-accuracy key word spotting (KWS). State-of-the-art accuracies on speech-related tasks have been achieved by long short-term memory (LSTM) neural network (NN) models. Such models are typically computationally intensive because of their heavy use of Matrix vector multiplication (MVM) operations. Compute-in-Memory (CIM) architectures, while well suited to MVM operations, have not seen widespread adoption for LSTMs. In this paper we adapt resistive random access memory based CIM architectures for KWS using LSTMs. We find that a hybrid system composed of CIM cores and digital cores achieves 90% test accuracy on the google speech data set at the cost of 25 uJ/decision. Our optimized architecture uses 5-bit inputs, and analog weights to produce 6-bit outputs. All digital computation are performed with 8-bit precision leading to a 3.7× improvement in computational efficiency compared to equivalent digital systems at that accuracy. 
    more » « less
  5. We present a fully digital multiply and accumulate (MAC) in-memory computing (IMC) macro demonstrating one of the fastest flexible precision integer-based MACs to date. The design boasts a new bit-parallel architecture enabled by a 10T bit-cell capable of four AND operations and a decomposed precision data flow that decreases the number of shift–accumulate operations, bringing down the overall adder hardware cost by 1.57× while maintaining 100% utilization for all supported precision. It also employs a carry save adder tree that saves 21% of adder hardware. The 28-nm prototype chip achieves a speed-up of 2.6× , 10.8× , 2.42× , and 3.22× over prior SoTA in 1bW:1bI, 1bW:4bI, 4bW:4bI, and 8bW:8bI MACs, respectively. 
    more » « less