In this paper we present Scaled Population Subtraction to fill a void in Scaled Population arithmetic. Scaled population (SP) arithmetic is a scheme that is inspired by stochastic computing (SC), a non-conventional approximate computing method that is well known for its simplicity, area efficiency and resilience to bit errors. SP arithmetic reduces the numerical errors compared to SC and also solves the serialization limitation of SC, since it is designed to have a O(1) gate delay. Previously, SP was limited to only addition and multiplication and did not have a way to perform subtraction. This paper introduces the basic SP subtraction idea, followed by a detailed study of several ways that the basic design can be improved to reduce the computational error. Our best SP design significantly improves the error compared to our basic SP subtraction idea (reducing it by 32.3%). We also study the trade-off between design complexity of the SP subtractor against output error. Also, our implementation of the SP subtractor exhibits an improved delay, power and area compared to fixed point realizations with the same size.
Scaled Population Arithmetic for Efficient Stochastic Computing
We propose a new Scaled Population (SP) based arithmetic computation approach that achieves considerable improvements over existing stochastic computing (SC) techniques. First, SP arithmetic introduces scaling operations that significantly reduce the numerical errors as compared to SC. Experiments show accuracy improvements of a single multiplication and addition operation by 6.3X and 4X, respectively. Secondly, SP arithmetic erases the inherent serialization associated with stochastic computing, thereby significantly improves the computational delays. We design each of the operations of SP arithmetic to take O(1) gate delays, and eliminate the need of serially iterating over the bits of the population vector. Our SP approach improves the area, delay and power compared with conventional stochastic computing on an FPGA-based implementation. We also apply our SP scheme on a handwritten digit recognition application (MNIST), improving the recognition accuracy by 32.79% compared to SC.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- ACM/IEEE Asia and South-Pacific Design Automation Conference
- Page Range or eLocation-ID:
- Sponsoring Org:
- National Science Foundation
More Like this
Multiply-accumulate (MAC) operations are common in data processing and machine learning but costly in terms of hardware usage. Stochastic Computing (SC) is a promising approach for low-cost hardware design of complex arithmetic operations such as multiplication. Computing with deterministic unary bit-streams (defined as bit-streams with all 1s grouped together at the beginning or end of a bit-stream) has been recently suggested to improve the accuracy of SC. Conventionally, SC designs use multiplexer (MUX) units or OR gates to accumulate data in the stochastic domain. MUX-based addition suffers from scaling of data and OR-based addition from inaccuracy. This work proposes a novel technique for MAC operation on unary bit-streamsthat allows exact, non-scaled addition of multiplication results. By introducing a relative delay between the products, we control correlation between bit-streams and eliminate OR-based addition error. We evaluate the accuracy of the proposed technique compared to the state-of-the-art MAC designs. After quantization, the proposed technique demonstrates at least 37% and up to 100% decrease of the mean absolute error for uniformly distributed random input values, compared to traditional OR-based MAC designs. Further, we demonstrate that the proposed technique is practical and evaluate area, power and energy of three possible implementations.
Stochastic computing (SC) is a re-emerging computing paradigm providing low-cost and noise-tolerant designs for a wide range of arithmetic operations. SC circuits operate on uniform bit-streams with the value determined by the probability of observing 1’s in the bit-stream. The accuracy of SC operations highly depends on the correlation between input bit-streams. While some operations such as minimum and maximum value functions require highly correlated inputs, some other such as multiplication operation need uncorrelated or independent inputs for accurate computation. Developing low-cost and accurate correlation manipulation circuits is an important research in SC as these circuits can manage correlation between bit-streams without expensive bit-stream regeneration. This work proposes a novel in-stream correlator and decorrelator circuit that manages 1) correlation between stochastic bit-streams, and 2) distribution of 1’s in the output bit-streams. Compared to state-of-the-art solutions, our designs achieve lower hardware cost and higher accuracy. The output bit-streams enjoy a low-discrepancy distribution of bits which leads to higher quality of results. The effectiveness of the proposed circuits is shown with two case studies: SC design of sorting and median filtering
Brain-inspired Hyper-dimensional(HD) computing is a novel and efficient computing paradigm. However, highly parallel architectures such as Processing-in-Memory(PIM) are bottle-necked by reduction operations required such as accumulation. To reduce this bottle-neck of HD computing in PIM, we present Stochastic-HD that combines the simplicity of operations in Stochastic Computing (SC) with the complex task solving capabilities of the latest HD computing algorithms. Stochastic-HD leverages deterministic SC, which enables all of HD operations to be done as highly parallel bitwise operations and removes all reduction operations, thus improving the throughput of PIM. To this end, we propose an in-memory hardware design for Stochastic-HD that exploits its high level of parallelism and robustness to approximation. Our hardware uses in-memory bitwise operations along with associative memory-like operations to enable a fast and energy-efficient implementation. With Stochastic-HD, we were able to reach a comparable accuracy with the Baseline-HD. Furthermore, by proposing an integrated Stochastic-HD retraining approach Stochastic-HD is able to reduce the accuracy loss to just 0.3%. We additionally accelerate the retraining process in our hardware design to create an end-to-end accelerator for Stochastic-HD. Finally, we also add support for HD Clustering to Stochastic-HD, which is the first to map the HD Clustering operations to themore »
Stochastic computing (SC) reduces the complexity of computation by representing numbers with long streams of independent bits. However, increasing performance in SC comes with either an increase in area or a loss in accuracy. Processing in memory (PIM) computes data in-place while having high memory density and supporting bit-parallel operations with low energy consumption. In this article, we propose COSMO, an architecture for co mputing with s tochastic numbers in me mo ry, which enables SC in memory. The proposed architecture is general and can be used for a wide range of applications. It is a highly dense and parallel architecture that supports most SC encodings and operations in memory. It maximizes the performance and energy efficiency of SC by introducing several innovations: (i) in-memory parallel stochastic number generation, (ii) efficient implication-based logic in memory, (iii) novel memory bit line segmenting, (iv) a new memory-compatible SC addition operation, and (v) enabling flexible block allocation. To show the generality and efficiency of our stochastic architecture, we implement image processing, deep neural networks (DNNs), and hyperdimensional (HD) computing on the proposed hardware. Our evaluations show that running DNN inference on COSMO is 141× faster and 80× more energy efficient as compared tomore »