skip to main content


Title: A High-Level Synthesis Approach to the Software/Hardware Codesign of NTT-Based Post-Quantum Cryptography Algorithms
Due to an emerging threat of quantum computing, one of the major challenges facing the cryptographic community is a timely transition from traditional public-key cryptosystems, such as RSA and Elliptic Curve Cryptography, to a new class of algorithms, collectively referred to as Post-Quantum Cryptography (PQC). Several promising candidates for a new PQC standard can have their software and hardware implementations accelerated using the Number Theoretic Transform (NTT). In this paper, we present an improved hardware architecture for NTT, with the hardware-friendly modular reduction, and demonstrate that this architecture can be efficiently implemented in hardware using High-Level Synthesis (HLS). The novel feature of the proposed architecture is an original memory write-back scheme, which assists in preparing coefficients for performing later NTT stages, saving memory storage used for precomputed constants. Our design is the most efficient for the case when log2N is even. The latency of our proposed architecture is approximately equal to (N log2(N) +3N)/4 clock cycles. As a proof of concept, we implemented the NTT operation for several parameter sets used in the PQC algorithms NewHope, FALCON, qTESLA, and CRYSTALS-DILITHIUM.  more » « less
Award ID(s):
1801512
PAR ID:
10174988
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China
Page Range / eLocation ID:
371 to 374
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Post-quantum cryptography (PQC) has drawn significant attention from various communities recently and one of the recent advances is the hardware acceleration of PQC algorithms. While Hamming Quasi-Cyclic (HQC) is one of the recently announced National Institute of Standards and Technology (NIST) fourth-round PQC standardization candidates, very few related hardware implementation works have been reported, particularly lacking solid works on important components such as the sampler. As a fixed-weight sparse vector sampler with constant-time operation is critical to the hardware HQC accelerator, in this paper, we present a novel hardware-implemented LOw-latency and ConStant-timing fixed-weight sampler (LOCS). In total, we have proposed three stages of efforts. First of all, a new algorithm for efficient realization of the fixed-weight sparse vector generation based on Fisher-Yates shuffle algorithm is proposed. Then, we have innovatively designed the algorithm into a new hardware sampler: LOCS. Finally, we have conducted a thorough comparison to showcase the efficiency of the proposed sampler, e.g., the proposed LOCS involves 66.7% less latency time than the state-of-the-art design (n=17,669) while remaining constant-time operation. To the authors' best knowledge, this is the first hardware-implemented pure constant-time (no failure probability) fixed-weight sampler for HQC. 
    more » « less
  2. Compared to traditional hardware development methodologies, High-Level Synthesis (HLS) offers a faster time-to-market and lower design cost at the expense of implementation efficiency. Although Software/Hardware Codesign has been used in many areas, its usability for benchmarking of candidates in cryptographic competitions has been largely unexplored. This paper provides a comparison of the HLS- and RTL-based design methodologies when applied to the hardware design of the Number Theoretic Transform (NTT) – a core arithmetic function of lattice-based Post-Quantum Cryptography (PQC). As a next step, we apply Software/Hardware Codesign approach to the implementation of three PQC schemes based on NTT. Then, we integrate our HLS implementation into the Xilinx SDSoC environment. We demonstrate that an overhead of SDSoC compared to traditional Bare Metal approach is acceptable. This paper also shows that an HLS implementation obtained by modeling a block diagram is typically much better than an implementation obtained by using design space exploration. We conclude that the HLS/SDSoC and RTL/Bare Metal approaches generate comparable results. 
    more » « less
  3. CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations and bottle-necks that require optimization. To address this challenge, we propose an algorithm and hardware co-design approach to systematically optimize matrix-vector multiplication and NTT-based polynomial multiplication by employing a novel sub-structure sharing technique in order to reduce computational complexity, i.e., the number of modular multiplications and modular additions/subtractions consumed. The sub-structure sharing approach is inspired by prior fast parallel approaches based on polyphase decomposition. The proposed efficient feed-forward architecture achieves high speed, low latency, and full utilization of all hardware components, which can significantly enhance the overall efficiency of the Kyber scheme. The FPGA implementation results show that our proposed design, using the fast two-parallel structure, leads to an approximate reduction of 90% in execution time (μs) , along with a 66× improvement in throughput performance. 
    more » « less
  4. This paper demonstrates an architecture for accelerating the polynomial multiplication using number theoretic transform (NTT). Kyber is one of the finalists in the third round of the NIST post-quantum cryptography standardization process. Simultaneously, the performance of NTT execution is its main challenge, requiring large memory and complex memory access pattern. In this paper, an efficient NTT architecture is presented to improve the respective computation time. We propose several optimization strategies for efficiency improvement targeting different performance requirements for various applications. Our NTT architecture, including four butterfly cores, occupies only 798 LUTs and 715 FFs on a small Artix-7 FPGA, showing more than 44% improvement compared to the best previous work. We also implement a coprocessor architecture for Kyber KEM benefiting from our high-speed NTT core to accomplish three phases of the key exchange in 9, 12, and 19 μs, respectively, operating at 200 MHz. 
    more » « less
  5. The ML-KEM post-quantum cryptography (PQC) scheme requires matrix-vector polynomial multiplication and polynomial arithmetic operations in the number theoretic transform (NTT) domain. Prior optimization approach KyberMat leverages the transposed-form fast filtering structure and sub-structure sharing technique, reducing the computational complexity. In this paper, a novel and area-efficient design builds upon the KyberMat framework, using the hierarchical interleaved folding algorithm to reduce hardware resources. Two design strategies are utilized in the proposed design. The proposed design initially scales down the NTT/inverse NTT processors via folding transformation, while utilizing a fixed number of DSPs and LUTs across different security levels of ML-KEM. This work further introduces a recursive summing unit along with the interleaving method to ensure continuous data processing and ultimately improve hardware utilization and throughput. The experimental result shows that our proposed area-efficient design achieves an average reduction of 71.55% in DSPs and 63.89% in LUTs among three different security levels, compared to the KyberMat framework. 
    more » « less