The recent research in post-quantum cryptography (PQC) field has gradually switched to efficient implementation of PQC algorithms on hardware platforms. As polynomial multiplication is typically one of the critical operations within lattice-based PQC, its hardware acceleration has drawn significant attention from the research community recently. We propose a high-speed processing strategy to construct a new High-performance Polynomial Multiplication Accelerator (HPMA) for key encapsulation mechanism (KEM) Saber. Firstly, we have given a detailed mathematical derivation to obtain a low-latency processing algorithm for Saber polynomial multiplication. Then, we have innovatively used the derived the proposed algorithm to construct a new structure HPMA for FPGA implementation. Lastly, we have demonstrated the superior performance of the proposed HPMA-Saber by comparing with state-of-the-art works. The proposed design strategy is highly efficient and the obtained results can be useful for the PQC research community.
more »
« less
LOCS: LOw-Latency and ConStant-Timing Implementation of Fixed-Weight Sampler for HQC
Post-quantum cryptography (PQC) has drawn significant attention from various communities recently and one of the recent advances is the hardware acceleration of PQC algorithms. While Hamming Quasi-Cyclic (HQC) is one of the recently announced National Institute of Standards and Technology (NIST) fourth-round PQC standardization candidates, very few related hardware implementation works have been reported, particularly lacking solid works on important components such as the sampler. As a fixed-weight sparse vector sampler with constant-time operation is critical to the hardware HQC accelerator, in this paper, we present a novel hardware-implemented LOw-latency and ConStant-timing fixed-weight sampler (LOCS). In total, we have proposed three stages of efforts. First of all, a new algorithm for efficient realization of the fixed-weight sparse vector generation based on Fisher-Yates shuffle algorithm is proposed. Then, we have innovatively designed the algorithm into a new hardware sampler: LOCS. Finally, we have conducted a thorough comparison to showcase the efficiency of the proposed sampler, e.g., the proposed LOCS involves 66.7% less latency time than the state-of-the-art design (n=17,669) while remaining constant-time operation. To the authors' best knowledge, this is the first hardware-implemented pure constant-time (no failure probability) fixed-weight sampler for HQC.
more »
« less
- Award ID(s):
- 2020625
- PAR ID:
- 10464960
- Date Published:
- Journal Name:
- 2023 IEEE International Symposium on Circuits and Systems (ISCAS)
- Page Range / eLocation ID:
- 1 to 5
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Post-quantum cryptography (PQC) has gained sub-stantial attention from various communities recently. Along with the ongoing National Institute of Standards and Technology (NIST) PQC standardization process that targets the general-purpose PQC algorithms, the research community is also looking for efficient lightweight PQC schemes. Among this direction of efforts, Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC) is regarded as a promising lightweight PQC fitting Internet-of-Things (IoT) and edge computing applications. As hardware implementation for PQC algorithms has become one of the major advances in the field, in this paper, we follow this trend to present an efficient implementation of RBLWE-ENC lightweight accelerator on the field-programmable gate array (FPGA) platform. Overall, we have demonstrated three coherent interdependent stages of efforts: (i) we have presented detailed derivation processes to formulate the proposed algorithmic operation; (ii) we have then implemented the proposed algorithm into a desired hardware accelerator; and (iii) we provided thorough complexity analysis and comparison to showcase the superior performance of the proposed accelerator over the state-of-the-art designs, e.g., the proposed accelerator with v=8 has at least 66.67% less area-time complexities than the existing ones (Virtex-7 FPGA). We hope the outcome of this work can facilitate lightweight PQC development.more » « less
-
Along the rapid development of large-scale quantum computers, post-quantum cryptography (PQC) has drawn significant attention from research community recently as it is proven that the existing public-key cryptosystems are vulnerable to the quantum attacks. Meanwhile, the recent trend in the PQC field has gradually switched to the hardware acceleration aspect. Following this trend, this work presents a novel implementation of a High-performance Polynomial Multiplication hardware Accelerator for NTRU (HPMA-NTRU) under different parameter settings, one of the lattice-based PQC algorithm that is currently under the consideration by the National Institute of Standards and Technology (NIST) PQC standardization process. In total, we have carried out three layers of efforts to obtain the proposed work. First of all, we have proposed a new schoolbook algorithm based strategy to derive the desired polynomial multiplication algorithm for NTRU. Then, we have mapped the algorithm to build a high-performance polynomial multiplication hardware accelerator and have extended this hardware accelerator to different parameter settings with proper adjustment. Finally, through a series of complexity analysis and implementation based comparison, we have shown that the proposed hardware accelerator obtains better area-time complexities than the state-of-the-art one. The outcome of this work is important and will impact the ongoing NIST PQC standardization process and can be deployed further to construct efficient NTRU cryptoprocessors.more » « less
-
Due to an emerging threat of quantum computing, one of the major challenges facing the cryptographic community is a timely transition from traditional public-key cryptosystems, such as RSA and Elliptic Curve Cryptography, to a new class of algorithms, collectively referred to as Post-Quantum Cryptography (PQC). Several promising candidates for a new PQC standard can have their software and hardware implementations accelerated using the Number Theoretic Transform (NTT). In this paper, we present an improved hardware architecture for NTT, with the hardware-friendly modular reduction, and demonstrate that this architecture can be efficiently implemented in hardware using High-Level Synthesis (HLS). The novel feature of the proposed architecture is an original memory write-back scheme, which assists in preparing coefficients for performing later NTT stages, saving memory storage used for precomputed constants. Our design is the most efficient for the case when log2N is even. The latency of our proposed architecture is approximately equal to (N log2(N) +3N)/4 clock cycles. As a proof of concept, we implemented the NTT operation for several parameter sets used in the PQC algorithms NewHope, FALCON, qTESLA, and CRYSTALS-DILITHIUM.more » « less
-
The recently announced National Institute of Standards and Technology (NIST) Post-quantum cryptography (PQC) third-round standardization process has released its candidates to be standardized and FALCON is one of them. On the other hand, however, very few hardware implementation works for FALCON have been released due to its very complicated computation procedure and intensive complexity. With this background, in this paper, we propose an efficient hardware structure to implement residue numeral system (RNS) decomposition within NTRUSolve (a key arithmetic component for key generation of FALCON). In total, we have proposed three stages of coherent interdependent efforts to finish the proposed work. First, we have identified the necessary algorithmic operation related to RNS decomposition. Then, we have innovatively designed a hardware structure to realize these algorithms. Finally, field-programmable gate array (FPGA)-based implementation has been carried out to verify the superior performance of the proposed hardware structure. For instance, the proposed hardware design involves at least 3.91x faster operational time than the software implementation. To the authors’ best knowledge, this is the first paper about the hardware acceleration of RNS decomposition for FALCON, and we hope the outcome of this work will facilitate the research in this area.more » « less