skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on August 10, 2026

Title: Low-Complexity NTT and INTT Structures via Twiddle Shifting
Polynomial modular multiplication is an important operation used in post-quantum cryptography and homomorphic encryption, which are based on ring learning with errors (RLWE) problems. For long polynomial lengths, this operation can be efficiently computed using number theoretic transform (NTT) and inverse NTT (INTT). In particular, negative wrapped convolution (NWC) has been proposed to compute this operation where zero padding is eliminated. Low-complexity structures for NTT (LCNTT) and INTT (LC-INTT) have been derived in prior work by using a divide-and-conquer approach. This paper presents an alternate derivation of the LC-NTT and LC-INTT structures from traditional NTT and INTT structures. Specifically, we show that using twiddle factor pushing (pulling) from left to right (right to left), we can derive the prior LC-NTT (LC-INTT) structures. We present systematic algorithms for twiddle factor pushing and pulling to derive the equivalent architectures. The alternate approach may provide opportunities for optimizing hardware implementations of polynomial modular multiplication.  more » « less
Award ID(s):
2243053
PAR ID:
10649628
Author(s) / Creator(s):
 ;  
Publisher / Repository:
IEEE Midwest Symposium on Circuits and Systems
Date Published:
ISSN:
1558-3899
Page Range / eLocation ID:
444 to 448
Subject(s) / Keyword(s):
Number theoretic transform (NTT) Polynomial multiplication. Twiddle shifting Post-quantum cryptography (PQC) Homomorphic encryption (HE)
Format(s):
Medium: X
Location:
Lansing, MI
Sponsoring Org:
National Science Foundation
More Like this
  1. CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations and bottle-necks that require optimization. To address this challenge, we propose an algorithm and hardware co-design approach to systematically optimize matrix-vector multiplication and NTT-based polynomial multiplication by employing a novel sub-structure sharing technique in order to reduce computational complexity, i.e., the number of modular multiplications and modular additions/subtractions consumed. The sub-structure sharing approach is inspired by prior fast parallel approaches based on polyphase decomposition. The proposed efficient feed-forward architecture achieves high speed, low latency, and full utilization of all hardware components, which can significantly enhance the overall efficiency of the Kyber scheme. The FPGA implementation results show that our proposed design, using the fast two-parallel structure, leads to an approximate reduction of 90% in execution time (μs) , along with a 66× improvement in throughput performance. 
    more » « less
  2. High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes three novel contributions. First, cascaded parallel NTT and iNTT architectures are proposed such that any buffer requirement for permuting the product of the NTTs before it is input to the iNTT is eliminated. This is achieved by using different folding sets for the NTTs and iNTT. Second, a novel approach to expand the set of feasible special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Third, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate such as in the cloud, as these feed-forward architectures can be pipelined at arbitrary levels. Pipelining and latency tradeoffs are also investigated. Compared to a prior design, the proposed architecture reduces latency by a factor of 49.2, and the area-time products (ATP) for the lookup table and DSP, ATP(LUT) and ATP(DSP), respectively, by 89.2% and 92.5%. Specifically, we show that for n =4096 and a 180-bit coefficient, the proposed 2-parallel architecture requires 6.3 Watts of power while operating at 240 MHz, with 6 moduli, each of length 30 bits, using Xilinx Virtex Ultrascale+ FPGA. 
    more » « less
  3. null (Ed.)
    The Number Theoretic Transform (NTT) enables faster polynomial multiplication and is becoming a fundamental component of next-generation cryptographic systems. NTT hardware designs have two prevalent problems related to design-time flexibility. First, algorithms have different arithmetic structures causing the hardware designs to be manually tuned for each setting. Second, applications have diverse throughput/area needs but the hardware have been designed for a fixed, pre-defined number of processing elements. This paper proposes a parametric NTT hardware generator that takes arithmetic configurations and the number of processing elements as inputs to produce an efficient hardware with the desired parameters and throughput. We illustrate the employment of the proposed design in two applications with different needs: A homomorphically encrypted deep neural network inference (CryptoNets) and a post-quantum digital signature scheme (qTESLA). We propose the first NTT hardware acceleration for both applications on FPGAs. Compared to prior software and high-level synthesis solutions, the results show that our hardware can accelerate NTT up to 28× and 48×, respectively. Therefore, our work paves the way for high-level, automated, and modular design of next-generation cryptographic hardware solutions. 
    more » « less
  4. Polynomial multiplication over the quotient ring is a critical operation in Ring Learning with Errors (Ring-LWE) based cryptosystems that are used for post-quantum cryptography and homomorphic encryption. This operation can be efficiently implemented using number-theoretic transform (NTT)-based architectures. Among these, pipelined parallel NTTbased polynomial multipliers are attractive for cloud computing as these are well suited for high throughput and low latency applications. For a given polynomial length, a pipelined parallel NTT-based multiplier can be designed with varying degrees of parallelism, resulting in different tradeoffs. Higher parallelism reduces latency but increases area and power consumption,and vice versa. In this paper, we develop a predictive model based on synthesized results for pipelined parallel NTT-based polynomial multipliers and analyze design tradeoffs in terms of area, power, energy, area-time product, and area-energy product across parallelism levels up to 128. We predict that, for very long polynomials, area and power differences between designs with varying levels of parallelism become negligible. In contrast, areatime product and energy per polynomial multiplication decrease with increased parallelism. Our findings suggest that, given area and power constraints, the highest feasible level of parallelism optimizes latency, area-time product, and energy per polynomial multiplication. 
    more » « less
  5. Bitslicing is a software implementation technique that treats an N-bit processor datapath as N parallel single-bit datapaths. Bitslicing is particularly useful to implement data-parallel algorithms, algorithms that apply the same operation sequence to every element of a vector. Indeed, a bit-wise processor instruction applies the same logical operation to every single-bit slice. A second benefit of bitsliced execution is that the natural spatial redundancy of bitsliced software can support countermeasures against fault attacks. A k-redundant program on an N-bit processor then runs as N/k parallel redundant slices. In this contribution, we combine these two benefits of bitslicing to implement a fault countermeasure for the number-theoretic transform (NTT). The NTT eiciently implements a polynomial multiplication. The internal symmetry of the NTT algorithm lends itself to a data-parallel implementation, and hence it is a good candidate for the redundantly bitsliced implementation. We implement a redundantly bitsliced NTT on an advanced 667MHz ARM Cortex-A9 processor, and study the fault coverage for the protected NTT under optimized electromagnetic fault injection (EMFI). Our work brings two major contributions. First, we show for the irst time how to develop a redundantly bitsliced version of the NTT. We integrate the protected NTT into a full Dilithium signature sequence. Second, we demonstrate an EMFI analysis on a prototype implementation of the Dilithium signature sequence on ARM Cortex-M9. We perform a detailed EM fault-injection parameter search to optimize the location, intensity and timing of injected EM pulses. We demonstrate that, under optimized fault injection parameters, about 10% of the injected faults become potentially exploitable. However, the redundantly bitsliced NTT design is able to catch the majority of these potentially exploitable faults, even when the remainder of the Dilithium algorithm as well as the control low is left unprotected. To our knowledge, this is the irst demonstration of a bitslice-redundant design of the NTT that offers distributed fault detection throughout the execution of the algorithm. 
    more » « less