skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Title: Approximate Hybrid Binary-Unary Computing with Applications in BERT Language Model and Image Processing
We propose a novel method for approximate hardware implementation of univariate math functions with significantly fewer hardware resources compared to previous approaches. Examples of such functions include exp(x) and the activation function GELU(x), both used in transformer networks, gamma(x), which is used in image processing, and other functions such as tanh(x), cosh(x), sq(x), and sqrt(x). The method builds on previous works on hybrid binary-unary computing. The novelty in our approach is that we break a function into a number of sub-functions such that implementing each sub-function becomes cheap, and converting the output of the sub-functions to binary becomes almost trivial. Our method also uses self-similarity in functions to further reduce the cost. We compare our method to the conventional binary, previous stochastic computing, and hybrid binary-unary methods on several functions at 8-, 12-, and 16-bit resolutions. While preserving high accuracy, our method outperforms previous works in terms of hardware cost, e.g., tolerating less than 0.01 mean absolute error, our method reduces the (area x latency) cost on average by 5, 7, and 2 orders of magnitude, compared to the conventional binary, stochastic computing, and hybrid binary-unary methods, respectively. Ultimately, we demonstrate the potential benefits of our method for natural language processing and image processing applications. We deploy our method to implement major blocks in an encoding layer of BERT language model, and also the Roberts Cross edge detection algorithm. Both include non-linear functions.  more » « less
Award ID(s):
2016390
PAR ID:
10478286
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays
ISBN:
9781450394178
Page Range / eLocation ID:
165 to 175
Subject(s) / Keyword(s):
hardware accelerators, approximate computing, unary computing, stochastic computing, BERT language model, image processing
Format(s):
Medium: X
Location:
Monterey CA USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Unary computing is a relatively new method for implementing non-linear functions using few hardware resources compared to binary computing. In its original form, unary computing provides no trade-off between accuracy and hardware cost. In this work, we propose a novel self-similarity-based method to optimize the previous hybrid binary-unary method and provide it with the trade-off between accuracy and hardware cost by introducing controlled levels of approximation. Given a target maximum error, our method breaks a function into sub-functions and tries to find the minimum set of unique sub-functions that can derive all the other ones through trivial bit-wise transformations. We compare our method to previous works such as HBU (hybrid binary-unary) and FloPoCo-PPA (piece-wise polynomial approximation) on a number of non-linear functions including Log, Exp, Sigmoid, GELU, Sin, and Sqr, which are used in neural networks and image processing applications. Without any loss of accuracy, our method can improve the area-delay-product hardware cost of HBU on average by 7% at 8-bit, 20% at 10-bit, and 35% at 12-bit resolutions. Given the approximation of the least significant bit, our method reduces the hardware cost of HBU on average by 21% at 8-bit, 49% at 10-bit, and 60% at 12-bit resolutions, and using the same error budget as given to FloPoCo-PPA, it reduces the hardware cost of FloPoCo-PPA on average by 79% at 8-bit, 58% at 10-bit, and 9% at 12-bit resolutions. We finally show the benefits of our method by implementing a 10-bit homomorphic filter, which is used in image processing applications. Our method can implement the filter with no quality loss at lower hardware cost than what the previous approximate and exact methods can achieve. 
    more » « less
  2. Unary computing is a relatively new method for implementing arbitrary nonlinear functions that uses unpacked thermometer number encoding, enabling much lower hardware costs. In its original form, unary computing provides no trade-off between accuracy and hardware cost. In this work, we propose a novel self-similarity-based method to optimize the previous hybrid binary-unary work and provide it with the trade-off between accuracy and hardware cost by introducing controlled levels of approximation. Looking for self-similarity between different parts of a function allows us to implement a very small subset of core unique subfunctions and derive the rest of the subfunctions from this core using simple linear transformations. We compare our method to previous works such as FloPoCo-LUT (lookup table), HBU (hybrid binary-unary) and FloPoCo-PPA (piecewise polynomial approximation) on several 8–12-bit nonlinear functions including Log, Exp, Sigmoid, GELU, Sin, and Sqr, which are frequently used in neural networks and image processing applications. The area × delay hardware cost of our method is on average 32%–60% better than previous methods in both exact and approximate implementations. We also extend our method to multivariate nonlinear functions and show on average 78%–92% improvement over previous work. 
    more » « less
  3. Constant coefficient multipliers are widely used in digital signal processing and machine learning architectures. Researchers have proposed HBU-CCM (hybrid binary-unary constant coefficient multiplier), which is an approximate method that outperforms conventional binary and FloPoCo-KCM (table-based real multiplier) methods in terms of hardware cost at the expense of accuracy due to aliasing issues. SimBU (self-similarity-based hybrid binary-unary) is another method that was recently proposed to implement general nonlinear functions using self-similarities leading to few hardware resources. In this work, we use a simplified version of the SimBU algorithm to address the aliasing issues of HBU-CCM and improve accuracy. We also implement a convolution kernel for a Gaussian blurring filter to evaluate our method and compare it to previous works. Our method outperforms conventional binary and FloPoCo-KCM methods in terms of hardware cost with desired accuracy and with no aliasing error as opposed to HBU-CCM. 
    more » « less
  4. Sorting is a fundamental function in many applications from data processing to database systems. For high performance, sorting-hardware based sorting designs are implemented by conventional binary or emerging stochastic computing (SC) approaches. Binary designs are fast and energy-efficient but costly to implement. SC-based designs, on the other hand, are area and power-efficient but slow and energy-hungry. So, the previous studies of the hardware-based sorting further faced scalability issues. In this work, we propose a novel scalable low-cost design for implementing sorting networks. We borrow the concept of SC for the area- and power efficiency but use weighted stochastic bit-streams to address the high latency and energy consumption issue of SC designs. A new lock and swap (LAS) unit is proposed to sort weighted bit-streams. The LAS-based sorting network can determine the result of comparing different input values early and then map the inputs to the corresponding outputs based on shorter weighted bit-streams. Experimental results show that the proposed design approach achieves much better hardware scalability than prior work. Especially, as increasing the number of inputs, the proposed scheme can reduce the energy consumption by about 3.8% - 93% compared to prior binary and SC-based designs. 
    more » « less
  5. Sorting data is needed in many application domains. Traditionally, the data is read from memory and sent to a general-purpose processor or application-specific hardware for sorting. The sorted data is then written back to the memory. Reading/writing data from/to memory and transferring data between memory and processing unit incur significant latency and energy overhead. In this work, we develop the first architectures for in-memory sorting of data to the best of our knowledge. We propose two architectures. The first architecture is applicable to the conventional format of representing data, i.e., weighted binary radix. The second architecture is proposed for developing unary processing systems, where data is encoded as uniform unary bit-streams. As we present, each of the two architectures has different advantages and disadvantages, making one or the other more suitable for a specific application. However, the common property of both is a significant reduction in the processing time compared to prior sorting designs. Our evaluations show on average 37 × and 138 × energy reduction for binary and unary designs, respectively, compared to conventional CMOS off-memory sorting systems in a 45nm technology. We designed a 3 × 3 and a 5 × 5 Median filter using the proposed sorting solutions, which we used for processing 64 × 64 pixel images. Our results show a reduction of 14 × and 634 × in energy and latency, respectively, with the proposed binary, and 5.6 × and 152 × 10 3 in energy and latency with the proposed unary approach compared to those of the off-memory binary and unary designs for the 3 × 3 Median filtering system. 
    more » « less