Data privacy has become a significant concern due to the rapid development of cloud services, Internet of Things, edge devices, and other applications. Homomorphic encryption (HE) addresses the issue by enabling computations to be performed without the decryption of the encrypted message. However, the bottleneck of designing homomorphic encryption hardware is the complexity of computation. To tackle the long integer arithmetic, the residue number system based on the Chinese remainder theorem is used. In this paper, we propose a novel modular reduction architecture that computes the mapping of residual polynomials in parallel with high speed and low latency. We implement our proposed design in the Xilinx Ultrascale+ FPGA board (VCU118). When the input sizes are 360-bit (1440-bit), the frequency is 180MHz (168MHz) with 4 pipelining stages. Also, the area delay product (ADP) of DSP blocks of our design is reduced by 23 and 31 percent, respectively, for 360 and 1440 bits, compared to prior work.
more »
« less
This content will become publicly available on August 1, 2026
Efficient Generalized Integer Division and Modular Reduction Architectures for Homomorphic Encryption
Homomorphic encryption allows computations to be performed directly on ciphertexts, serving as a key enabler for privacy-preserving cloud computing. The computations over ciphertexts involve large integer modular multiplications. Besides, the overall complexity of ciphertext multiplication can be further reduced by utilizing quotients of integer divisions. An efficient quotient computation architecture was developed in our previous work for the case that the divisor has three nonzero bits. This paper first generalizes our prior design to accommodate divisors with more nonzero bits. To further reduce the latency, a new non-iterative quotient computation method is developed by simplifying the Barrett reduction. Rigorous mathematical proofs are provided for both proposed schemes, and they are also adopted for modular reduction. Besides, this paper developed efficient hardware implementation architectures for our proposed algorithms, and optimizations are carried out to reduce the complexity further. Compared to the best prior integer division (modular reduction) schemes, our proposed designs can achieve around 60% (57%) smaller area and 70% (71%) shorter latency when the modulus has 64 bits.
more »
« less
- Award ID(s):
- 2428806
- PAR ID:
- 10635669
- Publisher / Repository:
- Springer
- Date Published:
- Journal Name:
- Journal of Signal Processing Systems
- Volume:
- 97
- Issue:
- 2-4
- ISSN:
- 1939-8018
- Page Range / eLocation ID:
- 157 to 171
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations and bottle-necks that require optimization. To address this challenge, we propose an algorithm and hardware co-design approach to systematically optimize matrix-vector multiplication and NTT-based polynomial multiplication by employing a novel sub-structure sharing technique in order to reduce computational complexity, i.e., the number of modular multiplications and modular additions/subtractions consumed. The sub-structure sharing approach is inspired by prior fast parallel approaches based on polyphase decomposition. The proposed efficient feed-forward architecture achieves high speed, low latency, and full utilization of all hardware components, which can significantly enhance the overall efficiency of the Kyber scheme. The FPGA implementation results show that our proposed design, using the fast two-parallel structure, leads to an approximate reduction of 90% in execution time (μs) , along with a 66× improvement in throughput performance.more » « less
-
Generalized integrated interleaved (GII) codes can nest BCH sub-codewords to form stronger BCH codewords. They are among the best candidates for error correction in the new storage class memories (SCMs). However, SCMs require short codeword length and low redundancy. In this case, the nested key equation solver (KES), which is a key step in GII decoding, has a small number of iterations. The initialization and/or scalar pre-computation in previous nested KES designs have large area and may take even longer time than the iterations themselves. This paper proposes an efficient nested KES design for short GII-BCH codes. The polynomial updating is decomposed into two steps to reduce the critical path without requiring scalar pre-computation. Besides, the KES is reformulated to reduce the number of clock cycles without incurring any area overhead. For an example code over GF(2^{10}) that protects 2560 bits with 10% redundancy, the proposed design achieves at least 25% area reduction and 37% reduction on the area-time product averaged over the nested decoding rounds compared to prior efforts.more » « less
-
Fully Homomorphic Encryption (FHE) presents a paradigm-shifting framework for performing computations on encrypted data, offering revolutionary implications for privacy-preserving technologies. This paper introduces a novel hardware implementation of scheme switching between two leading FHE schemes targeting different computational needs, i.e., arithmetic HE scheme CKKS, and Boolean HE scheme FHEW. The proposed architecture facilitates dynamic switching between the schemes with improved throughput and latency compared to the software baseline. The proposed architecture computation modules support scheme switching operations involving coefficient conversion, modular switching, and key switching. We also optimize the hardware designs for the pre-processing and post-processing blocks, involving key generation, encryption, and decryption. The effectiveness of our proposed design is verified on the Xilinx U280 Datacenter Acceleration FPGA. We demonstrate that the proposed scheme switching accelerator yields a 365× performance improvement over the software counterpart.more » « less
-
High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes three novel contributions. First, cascaded parallel NTT and iNTT architectures are proposed such that any buffer requirement for permuting the product of the NTTs before it is input to the iNTT is eliminated. This is achieved by using different folding sets for the NTTs and iNTT. Second, a novel approach to expand the set of feasible special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Third, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate such as in the cloud, as these feed-forward architectures can be pipelined at arbitrary levels. Pipelining and latency tradeoffs are also investigated. Compared to a prior design, the proposed architecture reduces latency by a factor of 49.2, and the area-time products (ATP) for the lookup table and DSP, ATP(LUT) and ATP(DSP), respectively, by 89.2% and 92.5%. Specifically, we show that for n =4096 and a 180-bit coefficient, the proposed 2-parallel architecture requires 6.3 Watts of power while operating at 240 MHz, with 6 moduli, each of length 30 bits, using Xilinx Virtex Ultrascale+ FPGA.more » « less
An official website of the United States government
