NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Architectures for Serial and Parallel Pipelined NTT-Based Polynomial Modular Multiplication

https://doi.org/10.1109/TVLSI.2025.3576782

Chiu, Sin-Wei; Parhi, Keshab K (June 2025, IEEE Transactions on Very Large Scale Integration (VLSI) Systems)

Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.
more » « less
Free, publicly-accessible full text available June 11, 2026
On Computing Linear, Positive-Wrapped (Circular), and Negative-Wrapped Convolutions in the Frequency Domain [Tips & Tricks]

https://doi.org/10.1109/MSP.2025.3544185

Chiu, Sin-Wei; Parhi, Keshab K (May 2025, IEEE Signal Processing Magazine)

Convolution is a fundamental operation with diverse applications in signal processing, computer vision, and machine learning. This article reviews three distinct convolutions: linear convolution (also referred to as aperiodic convolution), positive-wrapped convolution (PWC) (also known as circular convolution), and negative-wrapped convolution (NWC). Additionally, we propose an alternative approach to computing linear convolution without zero padding by leveraging the PWC and NWC. We compare two fast Fourier transform (FFT)-based methods to compute linear convolution: the traditional zero-padded PWC method and a new method based on the PWC and NWC. Through a detailed analysis of the flowgraphs (FGs), we demonstrate the equivalence of these methods while highlighting their unique characteristics. We show that computing the NWC using the weighted PWC method is equivalent to a part of the linear convolution computation with zero padding. Furthermore, it is possible to extract the PWC and NWC from structures to compute linear convolution with zero padding, where the last butterfly stage can be eliminated. This article aims to establish a clear connection among PWC, NWC, and linear convolution, illustrating new perspectives on computing different convolutions.
more » « less
Free, publicly-accessible full text available May 1, 2026
Hardware Acceleration for Fully Homomorphic Encryption Scheme Switching from CKKS to FHEW

https://doi.org/10.1109/IEEECONF60004.2024.10942749

Zhang, Kaiyuan; Wang, Antian; Parhi, Keshab K; Lao, Yingjie (October 2024, IEEE)

Fully Homomorphic Encryption (FHE) presents a paradigm-shifting framework for performing computations on encrypted data, offering revolutionary implications for privacy-preserving technologies. This paper introduces a novel hardware implementation of scheme switching between two leading FHE schemes targeting different computational needs, i.e., arithmetic HE scheme CKKS, and Boolean HE scheme FHEW. The proposed architecture facilitates dynamic switching between the schemes with improved throughput and latency compared to the software baseline. The proposed architecture computation modules support scheme switching operations involving coefficient conversion, modular switching, and key switching. We also optimize the hardware designs for the pre-processing and post-processing blocks, involving key generation, encryption, and decryption. The effectiveness of our proposed design is verified on the Xilinx U280 Datacenter Acceleration FPGA. We demonstrate that the proposed scheme switching accelerator yields a 365× performance improvement over the software counterpart.
more » « less
Full Text Available
Architectural Tradeoffs for Long Polynomial Modular Multiplication

https://doi.org/10.1109/IEEECONF60004.2024.10942651

Chiu, Sin-Wei; Parhi, Keshab K (October 2024, IEEE)

Polynomial multiplication over the quotient ring is a critical operation in Ring Learning with Errors (Ring-LWE) based cryptosystems that are used for post-quantum cryptography and homomorphic encryption. This operation can be efficiently implemented using number-theoretic transform (NTT)-based architectures. Among these, pipelined parallel NTTbased polynomial multipliers are attractive for cloud computing as these are well suited for high throughput and low latency applications. For a given polynomial length, a pipelined parallel NTT-based multiplier can be designed with varying degrees of parallelism, resulting in different tradeoffs. Higher parallelism reduces latency but increases area and power consumption,and vice versa. In this paper, we develop a predictive model based on synthesized results for pipelined parallel NTT-based polynomial multipliers and analyze design tradeoffs in terms of area, power, energy, area-time product, and area-energy product across parallelism levels up to 128. We predict that, for very long polynomials, area and power differences between designs with varying levels of parallelism become negligible. In contrast, areatime product and energy per polynomial multiplication decrease with increased parallelism. Our findings suggest that, given area and power constraints, the highest feasible level of parallelism optimizes latency, area-time product, and energy per polynomial multiplication.
more » « less
Full Text Available
HERMES: Homomorphic Encryption over Residual Number System for Multi-level EvaluationS

https://doi.org/10.1145/3676536.3697124

Wang, Antian; Zhang, Kaiyuan; Parhi, Keshab K; Lao, Yingjie (October 2024, ACM)

Homomorphic encryption enables computations on the ciphertext to preserve data privacy. However, its practical deployment has been hindered by the significant computational overhead compared to the plaintext computations. In response to this challenge, we present HERMES, a novel hardware acceleration system designed to explore the computation flow of the CKKS homomorphic encryption bootstrapping process. Among the major contributions of our proposed architecture, we first analyze the properties of the CKKS computation data flow and propose a new scheduling strategy by partitioning the computation modules into general-purpose and special-purpose modular computation modules to allow smaller resource consumption and flexible scheduling. The computation modules are also reconfigurable to reduce the memory access overhead during the intermediate computation. We also optimize the CKKS computation dataflow to improve the regularity with reduced control overhead.
more » « less
Full Text Available
Area-Efficient Matrix-Vector Polynomial Multiplication Architecture for ML-KEM Using Interleaving and Folding Transformation

https://doi.org/10.1109/ISCAS58744.2024.10558584

Tan, Weihang; Lao, Yingjie; Parhi, Keshab K (May 2024, IEEE)

The ML-KEM post-quantum cryptography (PQC) scheme requires matrix-vector polynomial multiplication and polynomial arithmetic operations in the number theoretic transform (NTT) domain. Prior optimization approach KyberMat leverages the transposed-form fast filtering structure and sub-structure sharing technique, reducing the computational complexity. In this paper, a novel and area-efficient design builds upon the KyberMat framework, using the hierarchical interleaved folding algorithm to reduce hardware resources. Two design strategies are utilized in the proposed design. The proposed design initially scales down the NTT/inverse NTT processors via folding transformation, while utilizing a fixed number of DSPs and LUTs across different security levels of ML-KEM. This work further introduces a recursive summing unit along with the interleaving method to ensure continuous data processing and ultimately improve hardware utilization and throughput. The experimental result shows that our proposed area-efficient design achieves an average reduction of 71.55% in DSPs and 63.89% in LUTs among three different security levels, compared to the KyberMat framework.
more » « less
Full Text Available
A Low-Latency Fft-Ifft Cascade Architecture

https://doi.org/10.1109/ICASSP48485.2024.10447370

Parhi, Keshab K. (April 2024, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))

This paper addresses the design of a partly-parallel cascaded FFT-IFFT architecture that does not require any intermediate buffer. Folding can be used to design partly-parallel architectures for FFT and IFFT. While many cascaded FFT-IFFT architectures can be designed using various folding sets for the FFT and the IFFT, for a specified folded FFT architecture, there exists a unique folding set to design the IFFT architecture that does not require an intermediate buffer. Such a folding set is designed by processing the output of the FFT as soon as possible (ASAP) in the folded IFFT. Elimination of the intermediate buffer reduces latency and saves area. The proposed approach is also extended to interleaved processing of multi-channel time-series. The proposed FFT-IFFT cascade architecture saves about N/2 memory elements and N/4 clock cycles of latency compared to a design with identical folding sets. For the 2-interleaved FFT-IFFT cascade, the memory and latency savings are, respectively, N/2 units and N/2 clock cycles, compared to a design with identical folding sets.
more » « less
Full Text Available
Long Polynomial Modular Multiplication Using Low-Complexity Number Theoretic Transform [Lecture Notes]

https://doi.org/10.1109/MSP.2024.3368239

Chiu, Sin-Wei; Parhi, Keshab K. (January 2024, IEEE Signal Processing Magazine)

This tutorial aims to establish connections between polynomial modular multiplication over a ring to circular convolution and the discrete Fourier transform (DFT). The main goal is to extend the well-known theory of the DFT in signal processing (SP) to other applications involving polynomials in a ring, such as homomorphic encryption (HE).
more » « less
Full Text Available
Low-Latency Preprocessing Architecture for Residue Number System via Flexible Barrett Reduction for Homomorphic Encryption

https://doi.org/10.1109/TCSII.2023.3344604

Chiu, Sin-Wei; Parhi, Keshab K. (December 2023, IEEE Transactions on Circuits and Systems II: Express Briefs)

Data privacy has become a significant concern due to the rapid development of cloud services, Internet of Things, edge devices, and other applications. Homomorphic encryption (HE) addresses the issue by enabling computations to be performed without the decryption of the encrypted message. However, the bottleneck of designing homomorphic encryption hardware is the complexity of computation. To tackle the long integer arithmetic, the residue number system based on the Chinese remainder theorem is used. In this paper, we propose a novel modular reduction architecture that computes the mapping of residual polynomials in parallel with high speed and low latency. We implement our proposed design in the Xilinx Ultrascale+ FPGA board (VCU118). When the input sizes are 360-bit (1440-bit), the frequency is 180MHz (168MHz) with 4 pipelining stages. Also, the area delay product (ADP) of DSP blocks of our design is reduced by 23 and 31 percent, respectively, for 360 and 1440 bits, compared to prior work.
more » « less
Full Text Available
KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition

https://doi.org/10.1109/ICCAD57390.2023.10323839

Tan, Weihang; Lao, Yingjie; Parhi, Keshab K. (October 2023, 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD))

CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations and bottle-necks that require optimization. To address this challenge, we propose an algorithm and hardware co-design approach to systematically optimize matrix-vector multiplication and NTT-based polynomial multiplication by employing a novel sub-structure sharing technique in order to reduce computational complexity, i.e., the number of modular multiplications and modular additions/subtractions consumed. The sub-structure sharing approach is inspired by prior fast parallel approaches based on polyphase decomposition. The proposed efficient feed-forward architecture achieves high speed, low latency, and full utilization of all hardware components, which can significantly enhance the overall efficiency of the Kyber scheme. The FPGA implementation results show that our proposed design, using the fast two-parallel structure, leads to an approximate reduction of 90% in execution time (μs) , along with a 66× improvement in throughput performance.
more » « less
Full Text Available

« Prev Next »

Search for: All records