Accelerating 1-Bit Llms Via in-Memory Computing Architectures

Malekar, Jinendra; Zand, Ramtin

doi:10.1109/MWSCAS53549.2025.11244527

Citation Details

Accelerating 1-Bit Llms Via in-Memory Computing Architectures

In this paper, we present a novel hybrid computing architecture designed to accelerate inference in 1-bit large language models (LLMs). Our approach combines the strengths of analog in-memory computing (IMC) and digital systolic arrays to address the diverse precision requirements across different layers of 1-bit LLMs. Specifically, we utilize analog IMC to accelerate low-precision matrix multiplication (MatMul) operations within the projection layers, which are naturally amenable to extreme quantization. Meanwhile, digital systolic arrays are employed to efficiently handle high-precision MatMul operations in the attention heads, preserving accuracy where precision is most critical. By partitioning the computational workload based on precision needs, our hybrid architecture increases throughput and energy efficiency. Experimental evaluations demonstrate that our design delivers up to an 80x improvement in tokens processed per second and achieves a 70% increase in energy efficiency (tokens per joule) when compared to conventional digital hardware accelerators. more »

Award ID(s):: 2409697 2340249

PAR ID:: 10674875

Author(s) / Creator(s):: Malekar, Jinendra ; Zand, Ramtin

Publisher / Repository:: IEEE

Date Published:: 2025-11-25

Journal Name:: Conference proceedings

ISSN:: 1558-3899

ISBN:: 979-8-3315-8934-9

Page Range / eLocation ID:: 178 to 182

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
https://doi.org/10.1109/MWSCAS53549.2025.11244527

More Like this