NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Accelerating 1-Bit Llms Via in-Memory Computing Architectures

https://doi.org/10.1109/MWSCAS53549.2025.11244527

Malekar, Jinendra; Zand, Ramtin (November 2025, Conference proceedings)

In this paper, we present a novel hybrid computing architecture designed to accelerate inference in 1-bit large language models (LLMs). Our approach combines the strengths of analog in-memory computing (IMC) and digital systolic arrays to address the diverse precision requirements across different layers of 1-bit LLMs. Specifically, we utilize analog IMC to accelerate low-precision matrix multiplication (MatMul) operations within the projection layers, which are naturally amenable to extreme quantization. Meanwhile, digital systolic arrays are employed to efficiently handle high-precision MatMul operations in the attention heads, preserving accuracy where precision is most critical. By partitioning the computational workload based on precision needs, our hybrid architecture increases throughput and energy efficiency. Experimental evaluations demonstrate that our design delivers up to an 80x improvement in tokens processed per second and achieves a 70% increase in energy efficiency (tokens per joule) when compared to conventional digital hardware accelerators.
more » « less
Full Text Available
Amdahl’s Law for LLMs: A Throughput-Centric Analysis of Extreme LLM Quantization

Malekar, Jinendra; Zand, Ramtin (September 2025, Transactions on machine learning research)

The emergence of 1-bit large language models (LLMs) has sparked significant interest, promising substantial efficiency gains through extreme quantization. However, these benefits are inherently limited by the portion of the model that can be quantized. Specifically, 1-bit quantization typically targets only the projection layers, while the attention mechanisms remain in higher precision, potentially creating significant throughput bottlenecks. To address this, we present an adaptation of Amdahl's Law specifically tailored to the LLMs, offering a quantitative framework for understanding the throughput limits of extreme quantization. Our analysis reveals how improvements in quantization can deliver substantial throughput gains, but only to the extent that they address critical throughput-constrained sections of the model. Through extensive experiments across diverse model architectures and hardware platforms, we highlight key trade-offs and performance ceilings, providing a roadmap for future research aimed at maximizing LLM throughput through more holistic quantization strategies.
more » « less
Full Text Available
A Decomposition-Based Memristive Crossbar Solver and FPGA-Accelerated Hardware Implementation

https://doi.org/10.1145/3716368.3735282

Singh, Suyash Vardhan; Kolinko, Anzhelika; Amin, Md Hasibul; Zand, Ramtin; Bakos, Jason D (June 2025, ACM)

Memristive crossbar-based analog processor-in-memory (PIM) architectures have the potential to deliver substantially higher energy efficiency for machine learning workloads than traditional architectures. The availability of a fast and accurate circuit-level simulation framework could enhance research and development efforts in this field. This paper introduces XbarSim, a domain-specific circuit-level solver designed to generate and solve the nodal equations of memristive crossbars including the effects of bitline and wordline resistance, and deploy the solver onto an FPGA emulator. XbarSim also supports partitioning larger arrays horizontally and vertically in order to subdivide the solver workload to manage memory locality and limit the resource requirement when deployed on an FPGA. The solver uses LU decomposition to pre-process the conductance matrix for each partition and solves for a batch of inputs to achieve high solver throughput. We demonstrate that XbarSim can achieve orders of magnitude speedup compared to Hspice across various sizes of memristive crossbars, and the XbarSim FPGA emulator can further achieve a 2X to 3X speedup over our software version built on Matlab.
more » « less
Full Text Available
CrossNAS: A Cross-Layer Neural Architecture Search Framework for PIM Systems

https://doi.org/10.1145/3716368

Amin, Md_Hasibul; Mohammadi, Mohammadreza; Bakos, Jason D; Zand, Ramtin (June 2025, ACM)
Peng, Lu; Vaisband, Boris; Chen, Fan; Zhou, Peipei; Kvatinsky, Shahar; Xie, Jiafeng (Ed.)
In this paper, we propose the CrossNAS framework, an automated approach for exploring a vast, multidimensional search space that spans various design abstraction layers—circuits, architecture, and systems—to optimize the deployment of machine learning workloads on analog processing-in-memory (PIM) systems. CrossNAS leverages the single-path one-shot weight-sharing strategy combined with the evolutionary search for the first time in the context of PIM system mapping and optimization. CrossNAS sets a new benchmark for PIM neural architecture search (NAS), outperforming previous methods in both accuracy and energy efficiency while maintaining comparable or shorter search times.
more » « less
Full Text Available
From Prompt to Accelerator: A Perspective on LLM-Based Analog In-Memory Accelerator Design Automation

https://doi.org/10.1145/3716368.3735276

Vungarala, Deepak; Amin, Md Hasibul; Roohi, Arman; Ghosh, Arnob; Zand, Ramtin; Angizi, Shaahin (June 2025, ACM)

Full Text Available
Magnetic In/Near-Sensor Architectures: From Raw Sensing to Smart Processing

https://doi.org/10.1145/3716368.3735267

Tabrizchi, Sepehr; Shafiee_Sarvestani, Ali; Amin, Md Hasibul; Najafi, Deniz; Angizi, Shaahin; Zand, Ramtin; Roohi, Arman (June 2025, ACM)

Full Text Available
PixelPrune: Optimizing AIoT Vision Systems via In-Sensor Segmentation and Adaptive Data Transfer

https://doi.org/10.1145/3716368.3735177

Mohammadi, Mohammadreza; Morsali, Mehrdad; Tabrizchi, Sepehr; Reidy, Brendan; Roohi, Arman; Angizi, Shaahin; Zand, Ramtin (June 2025, ACM)

This paper proposes PixelPrune, an approach to address two primary challenges in artificial intelligence of things (AIoT) vision systems: (1) the energy-intensive analog-to-digital converters (ADCs) required in the sensing unit for converting analog pixel arrays to digital tensors, and (2) the high data transfers between the sensing unit and computing unit. Our proposed solution involves the implementation of an in-sensor binary segmentation model on analog memristive crossbars to identify the important pixels and prune out the background information. Additionally, we propose a data transfer scheme that adaptively selects between dense and sparse data transfer formats based on the sparsity ratio measured from the segmentation mask obtained by the segmentation model. Our results demonstrate that the proposed object detection system achieves significant energy savings along with a considerable up to 95% reduction in data transfer, all while maintaining high accuracy.
more » « less
Full Text Available
FedChip: Federated LLM for Artificial Intelligence Accelerator Chip Design

https://doi.org/10.1109/ICLAD65226.2025.00019

Nazzal, Mahmoud; Nguyen, Khoa; Vungarala, Deepak; Zand, Ramtin; Angizi, Shaahin; Phan, Hai; Khreishah, Abdallah (June 2025, IEEE)

Full Text Available
Multi-Objective Neural Architecture Search for In-Memory Computing

https://doi.org/10.1109/ISVLSI61997.2024.00069

Amin, Md Hasibul; Mohammadi, Mohammadreza; Zand, Ramtin (July 2024, IEEE)

In this work, we employ neural architecture search (NAS) to enhance the efficiency of deploying diverse machine learning (ML) tasks on in-memory computing (IMC) architectures. Initially, we design three fundamental components inspired by the convolutional layers found in VGG and ResNet models. Subsequently, we utilize Bayesian optimization to construct a convolutional neural network (CNN) model with adaptable depths, employing these components. Through the Bayesian search algorithm, we explore a vast search space comprising over 640 million network configurations to identify the optimal solution, considering various multi-objective cost functions like accuracy/latency and accuracy/energy. Our evaluation of this NAS approach for IMC architecture deployment spans three distinct image classification datasets, demonstrating the effectiveness of our method in achieving a balanced solution characterized by high accuracy and reduced latency and energy consumption.
more » « less
Full Text Available

Search for: All records