Lookup tables are widely used in hardware to store arrays of constant values. For instance, complex mathematical functions in hardware are typically implemented through table-based methods such as plain tabulation, piecewise linear approximation, and bipartite or multipartite table methods, which primarily rely on lookup tables to evaluate the functions. Storing extensive tables of constant values, however, can lead to excessive hardware costs in resource-constrained edge devices such as FPGAs. In this paper, we propose a method, called CompressedLUT, as a lossless compression scheme to compress arrays of arbitrary data, implemented as lookup tables. Our method exploits decomposition, self-similarities, higher-bit compression, and multilevel compression techniques to maximize table size savings with no accuracy loss. CompressedLUT uses addition and arithmetic right shift beside several small lookup tables to retrieve original data during the decoding phase. Using such cost-effective elements helps our method use low area and deliver high throughput. For evaluation purposes, we compressed a number of different lookup tables, either obtained by direct tabulation of 12-bit elementary functions or generated by other table-based methods for approximating functions at higher resolutions, such as multipartite table method at 24-bit, piecewise polynomial approximation method at 36-bit, and hls4ml library at 18-bit resolutions. We implemented the compressed tables on FPGAs using HLS to show the efficiency of our method in terms of hardware costs compared to previous works. Our method demonstrated 60% table size compression and achieved 2.33 times higher throughput per slice than conventional implementations on average. In comparison, previous TwoTable and LDTC works compressed the lookup tables on average by 33% and 37%, which resulted in 1.63 and 1.29 times higher throughput than the conventional implementations, respectively. CompressedLUT is available as an open source tool.
more »
« less
CompressedLUT
CompressedLUT github
Lookup tables are widely used in hardware applications to store arrays of constant values. They can be directly used to evaluate nonlinear functions or used as a part of other approximate methods (e.g., piecewise linear approximation and bipartite tables) to compute such functions. CompressedLUT is a tool for lossless compression of lookup tables and generation of their hardware files in Verilog and C++ for RTL and HLS designs.
CompressedLUT has been developed as a part of the following publication. Please refer to it for more information.
Alireza Khataei and Kia Bazargan. 2024. CompressedLUT: An Open Source Tool for Lossless Compression of Lookup Tables for Function Evaluation and Beyond. In Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’24), March 3–5, 2024, Monterey, CA, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3626202.3637575
more »
« less
- Award ID(s):
- 2016390
- PAR ID:
- 10533921
- Publisher / Repository:
- Zenodo
- Date Published:
- Format(s):
- Medium: X
- Right(s):
- Creative Commons Attribution 4.0 International
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Random forests use ensembles of decision trees to boost accuracy for machine learning tasks. However, large ensembles slow down inference on platforms that process each tree in an ensemble individually. We present Bolt, a platform that restructures whole random forests, not just individual trees, to speed up inference. Conceptually, Bolt maps every path in each tree to a lookup table which, if cache were large enough, would allow inference with just one memory access. When the size of the lookup table exceeds cache capacity, Bolt employs a novel combination of lossless compression, parameter selection, and bloom filters to shrink the table while preserving fast inference. We compared inference speed in Bolt to three state-of-the-art platforms: Python Scikit-Learn, Ranger, and Forest Packing. We evaluated these platforms using datasets with vision, natural language processing and categorical applications. We observed that on ensembles of shallow decision trees Bolt can run 2-14X faster than competing platforms and that Bolt's speedups persist as the number of decision trees in an ensemble increases.more » « less
-
Non-uniform message quantization techniques such as reconstruction-computation-quantization (RCQ) improve error-correction performance and decrease hardware complexity of low-density parity-check (LDPC) decoders that use a flooding schedule. Layered MinSum RCQ (L-msRCQ) enables message quantization to be utilized for layered decoders and irregular LDPC codes. We investigate field-programmable gate array (FPGA) implementations of L-msRCQ decoders. Three design methods for message quantization are presented, which we name the Lookup, Broadcast, and Dribble methods. The decoding performance and hardware complexity of these schemes are compared to a layered offset MinSum (OMS) decoder. Simulation results on a (16384, 8192) protograph-based raptor-like (PBRL) LDPC code show that a 4-bit L-msRCQ decoder using the Broadcast method can achieve a 0.03 dB improvement in error-correction performance while using 12% fewer registers than the OMS decoder. A Broadcast-based 3-bit L-msRCQ decoder uses 15% fewer lookup tables, 18% fewer registers, and 13% fewer routed nets than the OMS decoder, but results in a 0.09 dB loss in performance.more » « less
-
We introduce the Differentiable Weightless Neural Network (DWN), a model based on interconnected lookup tables. Training of DWNs is enabled by a novel Extended Finite Difference technique for approximate differentiation of binary values. We propose Learnable Mapping, Learnable Reduction, and Spectral Regularization to further improve the accuracy and efficiency of these models. We evaluate DWNs in three edge computing contexts: (1) an FPGA-based hardware accelerator, where they demonstrate superior latency, throughput, energy efficiency, and model area compared to state-of-the-art solutions, (2) a low-power microcontroller, where they achieve preferable accuracy to XGBoost while subject to stringent memory constraints, and (3) ultra-low-cost chips, where they consistently outperform small models in both accuracy and projected hardware area. DWNs also compare favorably against leading approaches for tabular datasets, with higher average rank. Overall, our work positions DWNs as a pioneering solution for edge-compatible high-throughput neural networks.more » « less
-
Gene Expression Matrices (GEMs) are a fundamental data type in the genomics domain. As the size and scope of genomics experiments increase, researchers are struggling to process large GEMs through downstream workflows with currently accepted practices. In this paper, we propose a methodology to reduce the size of GEMs using multiple approaches. Our method partitions data into discrete fields based on data type and employs state-of-the-art lossless and lossy compression algorithms to reduce the input data size. This work explores a variety of lossless and lossy compression methods to determine which methods work the best for each component of a GEM. We evaluate the accuracy of the compressed GEMs by running them through the Knowledge Independent Network Construction (KINC) workflow and comparing the quality of the resulting gene co-expression network with a lossless control to verify result fidelity. Results show that utilizing a combination of lossy and lossless compression results in compression ratios up to 9.77× on a Yeast GEM, while still preserving the biological integrity of the data. Usage of the compression methodology on the Cancer Cell Line Encyclopedia(CCLE) GEM resulted in compression ratios up to 9.26×. By using this methodology, researchers in the Genomics domain may be able to process previously inaccessible GEMs while realizing significant reduction in computational costs.more » « less