skip to main content

This content will become publicly available on July 10, 2023

Title: MATCHA: a fast and energy-efficient accelerator for fully homomorphic encryption over the torus
Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x.
Authors:
; ;
Award ID(s):
1908992
Publication Date:
NSF-PAR ID:
10353599
Journal Name:
ACM/IEEE Design Automation Conference
Page Range or eLocation-ID:
235 to 240
Sponsoring Org:
National Science Foundation
More Like this
  1. Although Convolutional Neural Networks (CNNs) have demonstrated the state-of-the-art inference accuracy in various intelligent applications, each CNN inference involves millions of expensive floating point multiply-accumulate (MAC) operations. To energy-efficiently process CNN inferences, prior work proposes an electro-optical accelerator to process power-of-2 quantized CNNs by electro-optical ripple-carry adders and optical binary shifters. The electro-optical accelerator also uses SRAM registers to store intermediate data. However, electro-optical ripple-carry adders and SRAMs seriously limit the operating frequency and inference throughput of the electro-optical accelerator, due to the long critical path of the adder and the long access latency of SRAMs. In this paper, we propose a photonic nonvolatile memory (NVM)-based accelerator, Light-Bulb, to process binarized CNNs by high frequency photonic XNOR gates and popcount units. LightBulb also adopts photonic racetrack memory to serve as input/output registers to achieve high operating frequency. Compared to prior electro-optical accelerators, on average, LightBulb improves the CNN inference throughput by 17× ~ 173× and the inference throughput per Watt by 17.5 × ~ 660×.
  2. High-fidelity single- and two-qubit gates are essential building blocks for a fault-tolerant quantum computer. While there has been much progress in suppressing single-qubit gate errors in superconducting qubit systems, two-qubit gates still suffer from error rates that are orders of magnitude higher. One limiting factor is the residual ZZ-interaction, which originates from a coupling between computational states and higher-energy states. While this interaction is usually viewed as a nuisance, here we experimentally demonstrate that it can be exploited to produce a universal set of fast single- and two-qubit entangling gates in a coupled transmon qubit system. To implement arbitrary single-qubit rotations, we design a new protocol called the two-axis gate that is based on a three-part composite pulse. It rotates a single qubit independently of the state of the other qubit despite the strong ZZ-coupling. We achieve single-qubit gate fidelities as high as 99.1% from randomized benchmarking measurements. We then demonstrate both a CZ gate and a CNOT gate. Because the system has a strong ZZ-interaction, a CZ gate can be achieved by letting the system freely evolve for a gate time tg=53.8 ns. To design the CNOT gate, we utilize an analytical microwave pulse shape based on the SWIPHTmore »protocol for realizing fast, low-leakage gates. We obtain fidelities of 94.6% and 97.8% for the CNOT and CZ gates respectively from quantum progress tomography.« less
  3. Homomorphic encryption is a promising technology for enabling various privacy-preserving applications such as secure biomarker search. However, current implementations are not practical due to large performance overheads. A homomorphic encryption scheme has recently been proposed that allows bitwise comparison without the computationally-intensive multiplication and bootstrapping operations. Even so, this scheme still suffers from memory-bound performance bottleneck due to large ciphertext expansion. In this work, we propose HEGA, a near-data processing architecture that leverages this scheme with 3D-stacked memory to accelerate privacy-preserving biomarker search. We observe that homomorphic encryption-based search, like other emerging applications, can greatly benefit from the large throughput, capacity, and energy savings of 3D-stacked memory-based near-data processing architectures. Our near-data acceleration solution can speed up biomarker search by 6.3× with 5.7× energy savings compared to an 8-core Intel Xeon processor.
  4. Continuous advancements in LiDAR technology have enabled compelling wind turbulence measurements within the atmospheric boundary layer with range gates shorter than 20 m and sampling frequency of the order of 10 Hz. However, estimates of the radial velocity from the back-scattered laser beam are inevitably affected by an averaging process within each range gate, generally modeled as a convolution between the actual velocity projected along the LiDAR line-of-sight and a weighting function representing the energy distribution of the laser pulse along the range gate. As a result, the spectral energy of the turbulent velocity fluctuations is damped within the inertial sub-range with respective reduction of the velocity variance, and, thus, not allowing to take advantage of the achieved spatio-temporal resolution of the LiDAR technology. In this article, we propose to correct this turbulent energy damping on the LiDAR measurements by reversing the effect of a low-pass filter, which can be estimated directly from the LiDAR measurements. LiDAR data acquired from three different field campaigns are analyzed to describe the proposed technique, investigate the variability of the filter parameters and, for one dataset, assess the procedure for spectral LiDAR correction against sonic anemometer data. It is found that the order ofmore »the low-pass filter used for modeling the energy damping on the LiDAR velocity measurements has negligible effects on the correction of the second-order statistics of the wind velocity. In contrast, its cutoff frequency plays a significant role in the spectral correction encompassing the smoothing effects connected with the LiDAR gate length.« less
  5. Ultra-fast & low-power superconductor single-flux-quantum (SFQ)-based CNN systolic accelerators are built to enhance the CNN inference throughput. However, shift-register (SHIFT)-based scratchpad memory (SPM) arrays prevent a SFQ CNN accelerator from exceeding 40% of its peak throughput, due to the lack of random access capability. This paper first documents our study of a variety of cryogenic memory technologies, including Vortex Transition Memory (VTM), Josephson-CMOS SRAM, MRAM, and Superconducting Nanowire Memory, during which we found that none of the aforementioned technologies made a SFQ CNN accelerator achieve high throughput, small area, and low power simultaneously. Second, we present a heterogeneous SPM architecture, SMART, composed of SHIFT arrays and a random access array to improve the inference throughput of a SFQ CNN systolic accelerator. Third, we propose a fast, low-power and dense pipelined random access CMOS-SFQ array by building SFQ passive-transmission-line-based H-Trees that connect CMOS sub-banks. Finally, we create an ILP-based compiler to deploy CNN models on SMART. Experimental results show that, with the same chip area overhead, compared to the latest SHIFT-based SFQ CNN accelerator, SMART improves the inference throughput by 3.9 × (2.2 ×), and reduces the inference energy by 86% (71%) when inferring a single image (a batch of images).