Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
In this paper, we present a novel hybrid computing architecture designed to accelerate inference in 1-bit large language models (LLMs). Our approach combines the strengths of analog in-memory computing (IMC) and digital systolic arrays to address the diverse precision requirements across different layers of 1-bit LLMs. Specifically, we utilize analog IMC to accelerate low-precision matrix multiplication (MatMul) operations within the projection layers, which are naturally amenable to extreme quantization. Meanwhile, digital systolic arrays are employed to efficiently handle high-precision MatMul operations in the attention heads, preserving accuracy where precision is most critical. By partitioning the computational workload based on precision needs, our hybrid architecture increases throughput and energy efficiency. Experimental evaluations demonstrate that our design delivers up to an 80x improvement in tokens processed per second and achieves a 70% increase in energy efficiency (tokens per joule) when compared to conventional digital hardware accelerators.more » « less
-
The emergence of 1-bit large language models (LLMs) has sparked significant interest, promising substantial efficiency gains through extreme quantization. However, these benefits are inherently limited by the portion of the model that can be quantized. Specifically, 1-bit quantization typically targets only the projection layers, while the attention mechanisms remain in higher precision, potentially creating significant throughput bottlenecks. To address this, we present an adaptation of Amdahl's Law specifically tailored to the LLMs, offering a quantitative framework for understanding the throughput limits of extreme quantization. Our analysis reveals how improvements in quantization can deliver substantial throughput gains, but only to the extent that they address critical throughput-constrained sections of the model. Through extensive experiments across diverse model architectures and hardware platforms, we highlight key trade-offs and performance ceilings, providing a roadmap for future research aimed at maximizing LLM throughput through more holistic quantization strategies.more » « less
-
Memristive crossbar-based analog processor-in-memory (PIM) architectures have the potential to deliver substantially higher energy efficiency for machine learning workloads than traditional architectures. The availability of a fast and accurate circuit-level simulation framework could enhance research and development efforts in this field. This paper introduces XbarSim, a domain-specific circuit-level solver designed to generate and solve the nodal equations of memristive crossbars including the effects of bitline and wordline resistance, and deploy the solver onto an FPGA emulator. XbarSim also supports partitioning larger arrays horizontally and vertically in order to subdivide the solver workload to manage memory locality and limit the resource requirement when deployed on an FPGA. The solver uses LU decomposition to pre-process the conductance matrix for each partition and solves for a batch of inputs to achieve high solver throughput. We demonstrate that XbarSim can achieve orders of magnitude speedup compared to Hspice across various sizes of memristive crossbars, and the XbarSim FPGA emulator can further achieve a 2X to 3X speedup over our software version built on Matlab.more » « less
-
Peng, Lu; Vaisband, Boris; Chen, Fan; Zhou, Peipei; Kvatinsky, Shahar; Xie, Jiafeng (Ed.)In this paper, we propose the CrossNAS framework, an automated approach for exploring a vast, multidimensional search space that spans various design abstraction layers—circuits, architecture, and systems—to optimize the deployment of machine learning workloads on analog processing-in-memory (PIM) systems. CrossNAS leverages the single-path one-shot weight-sharing strategy combined with the evolutionary search for the first time in the context of PIM system mapping and optimization. CrossNAS sets a new benchmark for PIM neural architecture search (NAS), outperforming previous methods in both accuracy and energy efficiency while maintaining comparable or shorter search times.more » « less
-
This paper proposes PixelPrune, an approach to address two primary challenges in artificial intelligence of things (AIoT) vision systems: (1) the energy-intensive analog-to-digital converters (ADCs) required in the sensing unit for converting analog pixel arrays to digital tensors, and (2) the high data transfers between the sensing unit and computing unit. Our proposed solution involves the implementation of an in-sensor binary segmentation model on analog memristive crossbars to identify the important pixels and prune out the background information. Additionally, we propose a data transfer scheme that adaptively selects between dense and sparse data transfer formats based on the sparsity ratio measured from the segmentation mask obtained by the segmentation model. Our results demonstrate that the proposed object detection system achieves significant energy savings along with a considerable up to 95% reduction in data transfer, all while maintaining high accuracy.more » « less
-
In this work, we employ neural architecture search (NAS) to enhance the efficiency of deploying diverse machine learning (ML) tasks on in-memory computing (IMC) architectures. Initially, we design three fundamental components inspired by the convolutional layers found in VGG and ResNet models. Subsequently, we utilize Bayesian optimization to construct a convolutional neural network (CNN) model with adaptable depths, employing these components. Through the Bayesian search algorithm, we explore a vast search space comprising over 640 million network configurations to identify the optimal solution, considering various multi-objective cost functions like accuracy/latency and accuracy/energy. Our evaluation of this NAS approach for IMC architecture deployment spans three distinct image classification datasets, demonstrating the effectiveness of our method in achieving a balanced solution characterized by high accuracy and reduced latency and energy consumption.more » « less
An official website of the United States government

Full Text Available