Promising for digital signal processing applications, approximate computing has been extensively considered to tradeoff limited accuracy for improvements in other circuit metrics such as area, power, and performance. In this paper, approximate arithmetic circuits are proposed by using emerging nanoscale spintronic devices. Leveraging the intrinsic current-mode thresholding operation of spintronic devices, we initially present a hybrid spin-CMOS majority gate design based on a composite spintronic device structure consisting of a magnetic domain wall motion stripe and a magnetic tunnel junction. We further propose a compact and energy-efficient accuracy-configurable adder design based on the majority gate. Unlike most previous approximate circuit designs that hardwire a constant degree of approximation, this design is adaptive to the inherent resilience in various applications to different degrees of accuracy. Subsequently, we propose two new approximate compressors for utilization in fast multiplier designs. The device-circuit SPICE simulation shows 34.58% and 66% improvement in power consumption, respectively, for the accurate and approximate modes of the accuracy-configurable adder, compared to the recently reported domain wall motion-based full adder design. In addition, the proposed accuracy-configurable adder and approximate compressors can be efficiently utilized in the discrete cosine transform (DCT) as a widely-used digital image processing algorithm. The results indicate that the DCT and inverse DCT (IDCT) using the approximate multiplier achieve ~2x energy saving and 3x speed-up compared to an exactly-designed circuit, while achieving comparable quality in its output result.
more »
« less
SECO: A Scalable Accuracy Approximate Exponential Function Via Cross-Layer Optimization
From signal processing to emerging deep neural networks, a range of applications exhibit intrinsic error resilience. For such applications, approximate computing opens up new possibilities for energy-efficient computing by producing slightly inaccurate results using greatly simplified hardware. Adopting this approach, a variety of basic arithmetic units, such as adders and multipliers, have been effectively redesigned to generate approximate results for many error-resilient applications.In this work, we propose SECO, an approximate exponential function unit (EFU). Exponentiation is a key operation in many signal processing applications and more importantly in spiking neuron models, but its energy-efficient implementation has been inadequately explored. We also introduce a cross-layer design method for SECO to optimize the energy-accuracy trade-off. At the algorithm level, SECO offers runtime scaling between energy efficiency and accuracy based on approximate Taylor expansion, where the error is minimized by optimizing parameters using discrete gradient descent at design time. At the circuit level, our error analysis method efficiently explores the design space to select the energy-accuracy-optimal approximate multiplier at design time. In tandem, the cross-layer design and runtime optimization method are able to generate energy-efficient and accurate approximate EFU designs that are up to 99.7% accurate at a power consumption of 3.73 pJ per exponential operation. SECO is also evaluated on the adaptive exponential integrate-and-fire neuron model, yielding only 0.002% timing error and 0.067% value error compared to the precise neuron model.
more »
« less
- Award ID(s):
- 1845469
- PAR ID:
- 10156912
- Date Published:
- Journal Name:
- 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)
- Page Range / eLocation ID:
- 1 - 6
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Iterative computing, where the output accuracy gradually improves over multiple iterations, enables dynamic reconfiguration of energy-quality trade-offs by adjusting the latency (i.e., number of iterations). In order to take full advantage of the dynamic reconfigurability of iterative computing hardware, an efficient method for determining the optimal latency is crucial. In this paper, we introduce an integer linear programming (ILP)-based scheduling method to determine the optimal latency of iterative computing hardware. We consider the input-dependence of output accuracy of approximate hardware using data-driven error modeling for accurate quality estimation. The proposed method finds optimal or near-optimal latency with a significant speedup compared to exhaustive search and decision tree-based optimization.more » « less
-
We propose a novel method for approximate hardware implementation of univariate math functions with significantly fewer hardware resources compared to previous approaches. Examples of such functions include exp(x) and the activation function GELU(x), both used in transformer networks, gamma(x), which is used in image processing, and other functions such as tanh(x), cosh(x), sq(x), and sqrt(x). The method builds on previous works on hybrid binary-unary computing. The novelty in our approach is that we break a function into a number of sub-functions such that implementing each sub-function becomes cheap, and converting the output of the sub-functions to binary becomes almost trivial. Our method also uses self-similarity in functions to further reduce the cost. We compare our method to the conventional binary, previous stochastic computing, and hybrid binary-unary methods on several functions at 8-, 12-, and 16-bit resolutions. While preserving high accuracy, our method outperforms previous works in terms of hardware cost, e.g., tolerating less than 0.01 mean absolute error, our method reduces the (area x latency) cost on average by 5, 7, and 2 orders of magnitude, compared to the conventional binary, stochastic computing, and hybrid binary-unary methods, respectively. Ultimately, we demonstrate the potential benefits of our method for natural language processing and image processing applications. We deploy our method to implement major blocks in an encoding layer of BERT language model, and also the Roberts Cross edge detection algorithm. Both include non-linear functions.more » « less
-
Approximate computing (AC) leverages the inherent error resilience and is used in many big-data applications from various domains such as multimedia, computer vision, signal processing, and machine learning to improve systems performance and power consumption. Like many other approximate circuits and algorithms, the memory subsystem can also be used to enhance performance and save power significantly. This paper proposes an efficient and effective systematic methodology to construct an approximate non-volatile magneto-resistive RAM (MRAM) framework using consumer-off-the-shelf (COTS) MRAM chips. In the proposed scheme, an extensive experimental characterization of memory errors is performed by manipulating the write latency of MRAM chips which exploits the inherent (intrinsic/extrinsic process variation) stochastic switching behavior of magnetic tunnel junctions (MTJs). The experimental results, involving error-resilient image compression and machine learning applications, reveal that the proposed AC framework provides a significant performance improvement and demonstrates a reduction in MRAM write energy of ~47.5% on average with negligible or no loss in output quality.more » « less
-
As we reach the limit of Moore’s Law, researchers are exploring different paradigms to achieve unprecedented performance. Approximate Computing (AC), which relies on the ability of applications to tolerate some error in the results to trade-off accuracy for performance, has shown significant promise. Despite the success of AC in domains such as Machine Learning, its acceptance in High-Performance Computing (HPC) is limited due to its stringent requirement of accuracy. We need tools and techniques to identify regions of the code that are amenable to approximations and their impact on the application output quality so as to guide developers to employ selective approximation. To this end, we propose CHEF-FP, a flexible, scalable, and easy-to-use source-code transformation tool based on Automatic Differentiation (AD) for analysing approximation errors in HPC applications. CHEF-FP uses Clad, an efficient AD tool built as a plugin to the Clang compiler and based on the LLVM compiler infrastructure, as a backend and utilizes its AD abilities to evaluate approximation errors in C++ code. CHEF-FP works at the source level by injecting error estimation code into the generated adjoints. This enables the error-estimation code to undergo compiler optimizations resulting in improved analysis time and reduced memory usage. We also provide theoretical and architectural augmentations to source code transformation-based AD tools to perform FP error analysis. In this paper, we primarily focus on analyzing errors introduced by mixed-precision AC techniques, the most popular approximate technique in HPC. We also show the applicability of our tool in estimating other kinds of errors by evaluating our tool on codes that use approximate functions. Moreover, we demonstrate the speedups achieved by CHEF-FP during analysis time as compared to the existing state-of-the-art tool as a result of its ability to generate and insert approximation error estimate code directly into the derivative source. The generated code also becomes a candidate for better compiler optimizations contributing to lesser runtime performance overhead.more » « less
An official website of the United States government

