skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Search for: All records

Award ID contains: 2243053

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Quantum computers are a revolutionary class of computational platforms with applications in combinatorial and global optimization, machine learning, and other domains involving computationally hard problems. While these machines typically operate on qubits—quantum information elements that can occupy superpositions of the basis |0⟩ and |1⟩ states—recent advances have demonstrated the practical implementation of higher dimensional quantum systems (qudits) across various hardware platforms. In these hardware realizations, the higher order states are less stable, and thus remain coherent for a shorter duration than the basis |0⟩ and |1⟩ states. Moreover, formal methods for designing efficient encoder circuits for these systems remain underexplored. This limitation motivates the development of efficient circuit techniques for qudit systems (d-level quantum systems). Previous works have typically established generating gate sets for higher dimensional codes by generalizing the methods used for qubits. In this work, we introduce a systematic framework for optimizing encoder circuits for prime-dimension stabilizer codes. This framework is based on novel generating gate sets whose elements map directly to efficient Clifford gate sequences. We demonstrate the effectiveness of this method on key codes, achieving a 13% –44% reduction in encoder circuit gate count for the qutrit (d = 3) [[9,5,3]]3, [[5,1,3]]3, and [[7,1,3]]3 codes, and a 9% –21% reduction for the ququint (d = 5) [[10,6,3]]5 code when compared to prior work. We also achieved circuit depth reductions up to 42%. 
    more » « less
  2. Fast Fourier Transform (FFT) algorithms play a fundamental role in modern digital signal processing (DSP). A composite length FFT decomposes a longer-length FFT into multiple shorter-length FFTs, enabling a bottom-up construction approach. If the input signal is complex-valued (real-valued), the DFT computation is referred to as CFFT (RFFT). If the input is real-valued, redundancies and symmetries in the RFFT can be exploited to design more efficient algorithms. In this work, we extend the concept of two-stage composite FFTs to multistage composite FFTs. In a multistage FFT, the decomposition and parenthesization of sub-blocks significantly affect the structure’s efficiency. Although the set of sub-blocks remains the same, different parenthesizations lead to varying numbers of twiddle factor multiplications in the interconnect stages. We adopt radix-2 FFTs as the fundamental building blocks and apply dynamic programming to determine the optimal decomposition and parenthesization for arbitrary power-of-two FFT lengths. Our work presents optimal strategies for both complex-valued and real-valued FFTs. Compared to the radix-2 FFT, our multistage FFT reduces the number of complex multiplications by approximately 20% for complex-valued inputs and 30% for real-valued inputs. These results demonstrate that the multistage FFT approach not only simplifies the design process but also improves computational efficiency by significantly reducing the number of complex multiplications. Furthermore, the composite multistage structure enables an efficient bottom-up design methodology, requiring only two types of sub-blocks for CFFT and four types for RFFT implementations. Designers can follow the proposed straightforward design strategies, which guarantee the construction of optimal CFFT and RFFT structures. 
    more » « less
  3. Fast time-domain algorithms have been developed in signal processing applications to reduce the multiplication complexity. For example, fast convolution structures using Cook-Toom and Winograd algorithms are well understood. Short length fast convolutions can be iterated to obtain fast convolution structures for long lengths. In this paper, we show that well known fast convolution structures form the basis for design of fast algorithms in four other problem domains: fast parallel filters, fast polynomial modular multiplication, and fast pointwise multiplication in the DFT and NTT domains. Fast polynomial modular multiplication and fast pointwise multiplication problems are important for cryptosystem applications such as post-quantum cryptography and homomorphic encryption. By establishing the equivalence of these problems, we show that a fast structure from one domain can be used to design a fast structure for another domain. This understanding is important as there are many well known solutions for fast convolution that can be used in other signal processing and cryptosystem applications. 
    more » « less
  4. In our prior work, LayerPipe, we had introduced an approach to accelerate training of convolutional, fully connected, and spiking neural networks by overlapping forward and backward computation. However, despite empirical success, a principled understanding of how much gradient delay needs to be introduced at each layer to achieve desired level of pipelining was not addressed. This paper, LayerPipe2, fills that gap by formally deriving LayerPipe using variable delayed–gradient adaptation and retiming. We identify where delays may be legally inserted and show that the required amount of delay follows directly from the network structure: inner layers require fewer delays, while outer layers require longer delays. When pipelining is applied at every layer, each delay depends only on the number of remaining downstream stages; when layers are pipelined in groups, all layers in the group share the same assignment. These insights not only explain previously observed scheduling patterns but also expose an often-overlooked challenge: pipelining implicitly requires storage of historical weights. We overcome this storage bottleneck by developing a pipeline–aware moving average that reconstructs the required past states rather than storing them explicitly. This reduces memory cost without sacrificing the accuracy guarantees that makes pipelined learning viable. The result is a principled framework that illustrates how to construct LayerPipe architectures, predicts their delay requirements, and mitigates their storage burden, thereby enabling scalable pipelined training with controlled communication–computation tradeoffs. 
    more » « less
  5. Post-Quantum Cryptography (PQC) and Homomorphic Encryption (HE) are emerging security primitives that strengthen data protection against adversaries equipped with quantum computing capabilities. Although PQC and HE rely on similar underlying arithmetic operations, their hardware implementations are typically developed independently due to differences in key parameters such as polynomial length and modulus bit-width. This work presents a unified lattice-based polynomial modular accelerator that efficiently supports both PQC and HE primitives, bridging these two domains toward future secure computing architectures. The proposed design introduces highly reconfigurable modular computation units that enable low-overhead runtime configuration across the parameter ranges commonly used in PQC and HE schemes. This unified architecture eliminates the need for separate domain-specific accelerators by reusing shared computation structures and workload patterns across both cryptographic schemes. 
    more » « less
  6. Polynomial modular multiplication is an important operation used in post-quantum cryptography and homomorphic encryption, which are based on ring learning with errors (RLWE) problems. For long polynomial lengths, this operation can be efficiently computed using number theoretic transform (NTT) and inverse NTT (INTT). In particular, negative wrapped convolution (NWC) has been proposed to compute this operation where zero padding is eliminated. Low-complexity structures for NTT (LCNTT) and INTT (LC-INTT) have been derived in prior work by using a divide-and-conquer approach. This paper presents an alternate derivation of the LC-NTT and LC-INTT structures from traditional NTT and INTT structures. Specifically, we show that using twiddle factor pushing (pulling) from left to right (right to left), we can derive the prior LC-NTT (LC-INTT) structures. We present systematic algorithms for twiddle factor pushing and pulling to derive the equivalent architectures. The alternate approach may provide opportunities for optimizing hardware implementations of polynomial modular multiplication. 
    more » « less
  7. Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications. 
    more » « less
  8. Convolution is a fundamental operation with diverse applications in signal processing, computer vision, and machine learning. This article reviews three distinct convolutions: linear convolution (also referred to as aperiodic convolution), positive-wrapped convolution (PWC) (also known as circular convolution), and negative-wrapped convolution (NWC). Additionally, we propose an alternative approach to computing linear convolution without zero padding by leveraging the PWC and NWC. We compare two fast Fourier transform (FFT)-based methods to compute linear convolution: the traditional zero-padded PWC method and a new method based on the PWC and NWC. Through a detailed analysis of the flowgraphs (FGs), we demonstrate the equivalence of these methods while highlighting their unique characteristics. We show that computing the NWC using the weighted PWC method is equivalent to a part of the linear convolution computation with zero padding. Furthermore, it is possible to extract the PWC and NWC from structures to compute linear convolution with zero padding, where the last butterfly stage can be eliminated. This article aims to establish a clear connection among PWC, NWC, and linear convolution, illustrating new perspectives on computing different convolutions. 
    more » « less
  9. Polynomial multiplication over the quotient ring is a critical operation in Ring Learning with Errors (Ring-LWE) based cryptosystems that are used for post-quantum cryptography and homomorphic encryption. This operation can be efficiently implemented using number-theoretic transform (NTT)-based architectures. Among these, pipelined parallel NTTbased polynomial multipliers are attractive for cloud computing as these are well suited for high throughput and low latency applications. For a given polynomial length, a pipelined parallel NTT-based multiplier can be designed with varying degrees of parallelism, resulting in different tradeoffs. Higher parallelism reduces latency but increases area and power consumption,and vice versa. In this paper, we develop a predictive model based on synthesized results for pipelined parallel NTT-based polynomial multipliers and analyze design tradeoffs in terms of area, power, energy, area-time product, and area-energy product across parallelism levels up to 128. We predict that, for very long polynomials, area and power differences between designs with varying levels of parallelism become negligible. In contrast, areatime product and energy per polynomial multiplication decrease with increased parallelism. Our findings suggest that, given area and power constraints, the highest feasible level of parallelism optimizes latency, area-time product, and energy per polynomial multiplication. 
    more » « less
  10. Homomorphic encryption enables computations on the ciphertext to preserve data privacy. However, its practical deployment has been hindered by the significant computational overhead compared to the plaintext computations. In response to this challenge, we present HERMES, a novel hardware acceleration system designed to explore the computation flow of the CKKS homomorphic encryption bootstrapping process. Among the major contributions of our proposed architecture, we first analyze the properties of the CKKS computation data flow and propose a new scheduling strategy by partitioning the computation modules into general-purpose and special-purpose modular computation modules to allow smaller resource consumption and flexible scheduling. The computation modules are also reconfigurable to reduce the memory access overhead during the intermediate computation. We also optimize the CKKS computation dataflow to improve the regularity with reduced control overhead. 
    more » « less