NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs

https://doi.org/10.1145/3174243.3174253

Wang, Shuo; Li, Zhe; Ding, Caiwen; Yuan, Bo; Qiu, Qinru; Wang, Yanzhi; Liang, Yun (January 2018, ACM/SIGDA Intl. Symp. on Field-Programmable Gate Arrays (FPGA))

Full Text Available
Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework

Wang, Yanzhi; Ding, Caiwen; Yuan, Geng; Liao, Siyu; Li, Zhe; Ma, Xiaolong; Yuan, Bo; Qian, Xuehai; Tang, Jian; Qiu, Qinru; et al (January 2018, The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18))

Full Text Available
Implementation of a Near-Optimal Complex Root Clustering Algorithm

https://doi.org/10.1007/978-3-319-96418-8_28

Imbach, Rémi; Pan, Victor; Yap, Chee (January 2018, International Congress on Mathematical Software (ICMS) 2018)

Full Text Available
Energy-efficient, high-performance, highly-compressed deep neural network design using block-circulant matrices

https://doi.org/10.1109/ICCAD.2017.8203813

Liao, Siyu; Li, Zhe; Lin, Xue; Qiu, Qinru; Wang, Yanzhi; Yuan, Bo (November 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD))

Deep neural networks (DNNs) have emerged as the most powerful machine learning technique in numerous artificial intelligent applications. However, the large sizes of DNNs make themselves both computation and memory intensive, thereby limiting the hardware performance of dedicated DNN accelerators. In this paper, we propose a holistic framework for energy-efficient high-performance highly-compressed DNN hardware design. First, we propose block-circulant matrix-based DNN training and inference schemes, which theoretically guarantee Big-O complexity reduction in both computational cost (from O(n2) to O(n log n)) and storage requirement (from O(n2) to O(n)) of DNNs. Second, we dedicatedly optimize the hardware architecture, especially on the key fast Fourier transform (FFT) module, to improve the overall performance in terms of energy efficiency, computation performance and resource cost. Third, we propose a design flow to perform hardware-software co-optimization with the purpose of achieving good balance between test accuracy and hardware performance of DNNs. Based on the proposed design flow, two block-circulant matrix-based DNNs on two different datasets are implemented and evaluated on FPGA. The fixed-point quantization and the proposed block-circulant matrix-based inference scheme enables the network to achieve as high as 3.5 TOPS computation performance and 3.69 TOPS/W energy efficiency while the memory is saved by 108X ~ 116X with negligible accuracy degradation.
more » « less
Full Text Available
C ir CNN: accelerating and compressing deep neural networks using block-circulant weight matrices

https://doi.org/10.1145/3123939.3124552

Ding, Caiwen; Yuan, Geng; Ma, Xiaolong; Zhang, Yipeng; Tang, Jian; Qiu, Qinru; Lin, Xue; Yuan, Bo; Liao, Siyu; Wang, Yanzhi; et al (January 2017, Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture)

Large-scale deep neural networks (DNNs) are both compute and memory intensive. As the size of DNNs continues to grow, it is critical to improve the energy efficiency and performance while maintaining accuracy. For DNNs, the model size is an important factor affecting performance, scalability and energy efficiency. Weight pruning achieves good compression ratios but suffers from three drawbacks: 1) the irregular network structure after pruning, which affects performance and throughput; 2) the increased training complexity; and 3) the lack of rigirous guarantee of compression ratio and inference accuracy. To overcome these limitations, this paper proposes CirCNN, a principled approach to represent weights and process neural networks using block-circulant matrices. CirCNN utilizes the Fast Fourier Transform (FFT)-based fast multiplication, simultaneously reducing the computational complexity (both in inference and training) from O(n2) to O(n log n) and the storage complexity from O(n2) to O(n), with negligible accuracy loss. Compared to other approaches, CirCNN is distinct due to its mathematical rigor: the DNNs based on CirCNN can converge to the same "effectiveness" as DNNs without compression. We propose the CirCNN architecture, a universal DNN inference engine that can be implemented in various hardware/software platforms with configurable network architecture (e.g., layer type, size, scales, etc.). In CirCNN architecture: 1) Due to the recursive property, FFT can be used as the key computing kernel, which ensures universal and small-footprint implementations. 2) The compressed but regular network structure avoids the pitfalls of the network pruning and facilitates high performance and throughput with highly pipelined and parallel design. To demonstrate the performance and energy efficiency, we test CirCNN in FPGA, ASIC and embedded processors. Our results show that CirCNN architecture achieves very high energy efficiency and performance with a small hardware footprint. Based on the FPGA implementation and ASIC synthesis results, CirCNN achieves 6 - 102X energy efficiency improvements compared with the best state-of-the-art results.
more » « less
Full Text Available

Search for: All records