skip to main content


Title: Evaluating the Potential for Hardware Acceleration of Four NTRU-Based Key Encapsulation Mechanisms Using Software/Hardware Codesign
The speed of NTRU-based Key Encapsulation Mechanisms (KEMs) in software, especially on embedded software platforms, is limited by the long execution time of its primary operation, polynomial multiplication. In this paper, we investigate the potential for speeding up the implementations of four NTRU-based KEMs, using software/hardware codesign, when targeting Xilinx Zynq UltraScale+ multiprocessor system-on-chip (MPSoC). All investigated algorithms compete in Round 1 of the NIST PQC standardization process. They include: ntru-kem from the NTRUEncrypt submission, Streamlined NTRU Prime and NTRU LPRime KEMs of the NTRU Prime candidate, and NTRU- HRSS-KEM from the submission of the same name. The most time-consuming operation, polynomial multiplication, is implemented in the Programmable Logic (PL) of Zynq UltraScale+ (i.e., in hardware) using constant-time hardware architectures most appropriate for a given algorithm. The remaining operations are executed in the Processing System (PS) of Zynq, based on the ARM Cortex-A53 Application Processing Unit. The speed-ups of our software/hardware codesigns vs. purely software implementations, running on the same Zynq platform, are determined experimentally, and analyzed in the paper. Our experiments reveal substantial differences among the investigated candidates in terms of their potential to benefit from hardware accelerators, with the special focus on accelerators aimed at offloading to hardware only the most time-consuming operation of a given cryptosystems. The demonstrated speed-ups vs. functionally equivalent purely software implementations vary between 4.0 and 42.7 for encapsulation, and between 6.4 and 149.7 for decapsulation.  more » « less
Award ID(s):
1801512
NSF-PAR ID:
10108523
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Post-Quantum Cryptography, 10th International Conference, PQCrypto 2019 Chongqing, China, May 8–10, 2019 Revised Selected Papers, Lecture Notes in Computer Science
Volume:
11505
ISSN:
0302-9743
Page Range / eLocation ID:
23 - 43
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. When quantum computers become scalable and reliable, they are likely to break all public-key cryptography standards, such as RSA and Elliptic Curve Cryptography. The projected threat of quantum computers has led the U.S. National Institute of Standards and Technology (NIST) to an effort aimed at replacing existing public-key cryptography standards with new quantum-resistant alternatives. In December 2017, 69 candidates were accepted by NIST to Round 1 of the NIST Post-Quantum Cryptography (PQC) standardization process. NTRUEncrypt is one of the most well-known PQC algorithms that has withstood cryptanalysis. The speed of NTRUEncrypt in software, especially on embedded software platforms, is limited by the long execution time of its primary operation, polynomial multiplication. In this paper, we investigate speeding up NTRUEncrypt using software/hardware codesign on a Xilinx Zynq UltraScale+ multiprocessor system-on-chip (MPSoC). Polynomial multiplication is implemented in the Programmable Logic (PL) of Zynq using two approaches: traditional Register-Transfer Level (RTL) and High-Level Synthesis (HLS). The remaining operations of NTRUEncrypt are executed in software on the Processing System (PS) of Zynq, using the bare-metal mode. The speed-up of our software/hardware codesigns vs. purely software implementations is determined experimentally and analyzed in the paper. The results are reported for the RTL-based and HLS-based hardware accelerators, and compared to the best available software implementation, included in the NIST submission package. The speed-ups for encryption were 2.4 and 3.9, depending on the selected parameter set. For decryption, the corresponding speed-ups were 4.0 and 6.8. In addition, for the polynomial multiplication operation itself, the speed up was in excess of 75. Our code for the NTRUEncrypt polynomial multiplier accelerator is being made open-source for further evaluation on multiple software/hardware platforms. 
    more » « less
  2. When quantum computers become scalable and reliable, they are likely to break all public-key cryptography standards, such as RSA and Elliptic Curve Cryptography. The projected threat of quantum computers has led the U.S. National Institute of Standards and Technology (NIST) to an effort aimed at replacing existing public-key cryptography standards with new quantum-resistant alternatives. In December 2017, 69 candidates were accepted by NIST to Round 1 of the NIST Post-Quantum Cryptography (PQC) standardization process. NTRUEncrypt is one of the most well-known PQC algorithms that has withstood cryptanalysis. The speed of NTRUEncrypt in software, especially on embedded software platforms, is limited by the long execution time of its primary operation, polynomial multiplication. In this paper, we investigate speeding up NTRUEncrypt using software/hardware codesign on a Xilinx Zynq UltraScale+ multiprocessor system-on-chip (MPSoC). Polynomial multiplication is implemented in the Programmable Logic (PL) of Zynq using two approaches: traditional Register-Transfer Level (RTL) and High-Level Synthesis (HLS). The remaining operations of NTRUEncrypt are executed in software on the Processing System (PS) of Zynq, using the bare-metal mode. The speed-up of our software/hardware codesigns vs. purely software implementations is determined experimentally and analyzed in the paper. The results are reported for the RTL-based and HLS-based hardware accelerators, and compared to the best available software implementation, included in the NIST submission package. The speed-ups for encryption were 2.4 and 3.9, depending on the selected parameter set. For decryption, the corresponding speed-ups were 4.0 and 6.8. In addition, for the polynomial multiplication operation itself, the speed up was in excess of 75. Our code for the NTRUEncrypt polynomial multiplier accelerator is being made open-source for further evaluation on multiple software/hardware platforms. 
    more » « less
  3. Cheon, Jung Hee ; Tillich, Jean-Pierre (Ed.)
    This paper focuses on high-speed NEON-based constant-time implementations of multiplication of large polynomials in the NIST PQC KEM Finalists: NTRU, Saber, and CRYSTALS-Kyber. We use the Number Theoretic Transform (NTT)-based multiplication in Kyber, the Toom-Cook algorithm in NTRU, and both types of multiplication in Saber. Following these algorithms and using Apple M1, we improve the decapsulation performance of the NTRU, Kyber, and Saber-based KEMs at the security level 3 by the factors of 8.4, 3.0, and 1.6, respectively, compared to the reference implementations. On Cortex-A72, we achieve the speed-ups by factors varying between 5.7 and 7.5x for the Forward/Inverse NTT in Kyber, and between 6.0 and 7.8x for Toom-Cook in NTRU, over the best existing implementations in pure C. For Saber, when using NEON instructions on Cortex-A72, the implementation based on NTT outperforms the implementation based on the Toom-Cook algorithm by 14% in the case of the MatrixVectorMul function but is slower by 21% in the case of the InnerProduct function. Taking into account that in Saber, keys are not available in the NTT domain, the overall performance of the NTT-based version is very close to the performance of the Toom-Cook version. The differences for the entire decapsulation at the three major security levels (1, 3, and 5) are −4, −2, and +2%, respectively. Our benchmarking results demonstrate that our NEON-based implementations run on an Apple M1 ARM processor are comparable to those obtained using the best AVX2-based implementations run on an AMD EPYC 7742 processor. Our work is the first NEON-based ARMv8 implementation of each of the three NIST PQC KEM finalists. 
    more » « less
  4. It has been predicted that within the next tenfifteen years, quantum computers will have computational power sufficient to break current public-key cryptography schemes. When that happens, all traditional methods of dealing with the growing computational capabilities of potential attackers, such as increasing key sizes, will be futile. The only viable solution is to develop new standards based on algorithms that are resistant to quantum computer attacks and capable of being executed on traditional computing platforms, such as microprocessors and FPGAs. Leading candidates for new standards include lattice-based post-quantum cryptography (PQC) algorithms. In this paper, we present the results of implementing and benchmarking three lattice-based key encapsulation mechanisms (KEMs) that have progressed to Round 2 of the NIST standardization process. Our implementations are based on a software/hardware codesign approach, which is particularly applicable to the current stage of the NIST PQC standardization process, where the large number and high complexity of the candidates make traditional hardware benchmarking extremely challenging. We propose and justify the choice of a suitable system-on-chip platform and design methodology. The obtained results indicate the potential for very substantial speed-ups vs. purely software implementations, reaching 28x for encapsulation and 20x for decapsulation. 
    more » « less
  5. The rapid advancement in quantum technology has initiated a new round of post-quantum cryptography (PQC) related exploration. The key encapsulation mechanism (KEM) Saber is an important module lattice-based PQC, which has been selected as one of the PQC finalists in the ongoing National Institute of Standards and Technology (NIST) standardization process. On the other hand, however, efficient hardware implementation of KEM Saber has not been well covered in the literature. In this paper, therefore, we propose a novel cyclic-row oriented processing (CROP) strategy for efficient implementation of the key arithmetic operation of KEM Saber, i.e., the polynomial multiplication. The proposed work consists of three layers of interdependent efforts: (i) first of all, we have formulated the main operation of KEM Saber into desired mathematical forms to be further developed into CROP based algorithms, i.e., the basic version and the advanced higher-speed version; (ii) then, we have followed the proposed CROP strategy to innovatively transfer the derived two algorithms into desired polynomial multiplication structures with the help of a series of algorithm-architecture co-implementation techniques; (iii) finally, detailed complexity analysis and implementation results have shown that the proposed polynomial multiplication structures have better area-time complexities than the state-of-the-art solutions. Specifically, the field-programmable gate array (FPGA) implementation results show that the proposed design, e.g., the basic version has at least less 11.2% area-delay product (ADP) than the best competing one (Cyclone V device). The proposed high-performance polynomial multipliers offer not only efficient operation for output results delivery but also possess low-complexity feature brought by CROP strategy. The outcome of this work is expected to provide useful references for further development and standardization process of KEM Saber. 
    more » « less