NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An ASIC Accelerator for QNN With Variable Precision and Tunable Energy Efficiency

https://doi.org/10.1109/TCAD.2024.3357597

Wagle, Ankit; Singh, Gian; Khatri, Sunil; Vrudhula, Sarma (July 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

This article presents TULIP, a new architecture for a variable precision quantized neural network (QNN) inference. It is designed with the goal of maximizing energy efficiency per classification. TULIP is constructed by arranging a collection of unique processing elements (TULIP-PEs) in a single-instruction–multiple-data (SIMD) fashion. Each TULIP-PE contains binary neurons that are interconnected using multiplexers. Each neuron also has a small dedicated local register connected to it. The binary neurons are implemented as standard cells and used for implementing threshold functions, i.e., an inner-product and thresholding operation on its binary inputs. The neurons can be reconfigured with a single change in the control signals to implement all the standard operations used in a QNN. This article presents novel algorithms for implementing the operations of a QNN on the TULIP-PEs in the form of a schedule of threshold functions. TULIP was implemented as an ASIC in TSMC 40nm-LP technology. A QNN accelerator that employs a conventional multiply and accumulate-based arithmetic processor was also implemented in the same technology to provide a fair comparison. The results show that TULIP is 30X−50X more energy-efficient than an equivalent design, without any penalty in performance, area, or accuracy. Furthermore, TULIP achieves these improvements without using traditional techniques such as voltage scaling or approximate computing. Finally, this article also demonstrates how the run-time tradeoff between accuracy and energy efficiency is done on the TULIP architecture.
more » « less
Full Text Available
Scaled Population Division for Approximate Computing

https://doi.org/10.1109/ISLPED58423.2023.10244709

Bharathi, Kunal; Khatri, Sunil P.; Hu, Jiang (August 2023, Proceedings International Symposium on Low Power Electronics and Design)

In this paper we present an approximate division scheme for Scaled Population (SP) arithmetic, a technique that improves on the limitations of stochastic computing (SC). SP arithmetic circuits are designed (a) to perform all operations with a constant delay, and (b) they use scaling operations to help reduce errors compared to SC circuits. As part of this work, we also present a method to correlate two SP numbers with a constant delay. We compare our SP divider with SC dividers, as well as fixed-point dividers (in terms of area, power and delay). Our 512-bit SP divider has a delay (power) that is 0.08× (0.06×) that of the equivalent fixed-point binary divider. Compared to a equivalent SC divider, our power-delay-product is 13× better. Index Terms—Approximate Arithmetic, Stochastic Computing, Computer Arithmetic, Approximate Division, Fast Division
more » « less
Full Text Available
A Novel ASIC Design Flow Using Weight-Tunable Binary Neurons as Standard Cells

https://doi.org/10.1109/TCSI.2022.3164995

Wagle, Ankit; Singh, Gian; Khatri, Sunil; Vrudhula, Sarma (July 2022, IEEE Transactions on Circuits and Systems I: Regular Papers)

In this paper, we describe a design of a mixed-signal circuit for an binary neuron (a.k.a perceptron, threshold logic gate) and a methodology for automatically embedding such cells in ASICs. The binary neuron, referred to as an FTL (flash threshold logic) uses floating gate or flash transistors whose threshold voltages serve as a proxy for the weights of the neuron. Algorithms for mapping the weights to the flash transistor threshold voltages are presented. The threshold voltages are determined to maximize both the robustness of the cell and its speed. The performance, power, and area of a single FTL cell are shown to be significantly smaller (79.4%), consume less power (61.6%), and operate faster (40.3%) compared to conventional CMOS logic equivalents. Also included are the architecture and the algorithms to program the flash devices of an FTL. The FTL cells are implemented as standard cells, and are designed to allow commercial synthesis and P&R tools to automatically use them in synthesis of ASICs. Substantial reductions in area and power without sacrificing performance are demonstrated on several ASIC benchmarks by the automatic embedding of FTL cells. The paper also demonstrates how FTL cells can be used for fixing timing errors after fabrication.
more » « less
Full Text Available
Scaled Population Subtraction for Approximate Computing

https://doi.org/10.1109/ICCD50377.2020.00065

Bharathi, Kunal; Hu, Jiang; Khatri, Sunil P. (October 2020, IEEE International Conference on Computer Design)
null (Ed.)
In this paper we present Scaled Population Subtraction to fill a void in Scaled Population arithmetic. Scaled population (SP) arithmetic is a scheme that is inspired by stochastic computing (SC), a non-conventional approximate computing method that is well known for its simplicity, area efficiency and resilience to bit errors. SP arithmetic reduces the numerical errors compared to SC and also solves the serialization limitation of SC, since it is designed to have a O(1) gate delay. Previously, SP was limited to only addition and multiplication and did not have a way to perform subtraction. This paper introduces the basic SP subtraction idea, followed by a detailed study of several ways that the basic design can be improved to reduce the computational error. Our best SP design significantly improves the error compared to our basic SP subtraction idea (reducing it by 32.3%). We also study the trade-off between design complexity of the SP subtractor against output error. Also, our implementation of the SP subtractor exhibits an improved delay, power and area compared to fixed point realizations with the same size.
more » « less
Full Text Available

Search for: All records