Conference PaperAccelerating 1-Bit Llms Via in-Memory Computing ArchitecturesMalekar, Jinendra [Computer Science and Engineering, University of South Carolina,Columbia,SC,29201]; Zand, Ramtin [Computer Science and Engineering, University of South Carolina,Columbia,SC,29201]In this paper, we present a novel hybrid computing architecture designed to accelerate inference in 1-bit large language models (LLMs). Our approach combines the strengths of analog in-memory computing (IMC) and digital systolic arrays to address the diverse precision requirements across different layers of 1-bit LLMs. Specifically, we utilize analog IMC to accelerate low-precision matrix multiplication (MatMul) operations within the projection layers, which are naturally amenable to extreme quantization. Meanwhile, digital systolic arrays are employed to efficiently handle high-precision MatMul operations in the attention heads, preserving accuracy where precision is most critical. By partitioning the computational workload based on precision needs, our hybrid architecture increases throughput and energy efficiency. Experimental evaluations demonstrate that our design delivers up to an 80x improvement in tokens processed per second and achieves a 70% increase in energy efficiency (tokens per joule) when compared to conventional digital hardware accelerators.IEEE2025-11-2510674875Conference proceedings178 to 1821558-3899979-8-3315-8934-9https://doi.org/10.1109/MWSCAS53549.2025.112445272409697; 2340249National Science Foundation