skip to main content


Title: A Multiplatform Parallel Approach for Lattice Sieving Algorithms
Lattice sieving is currently the leading class of algorithms for solving the shortest vector problem over lattices. The computational difficulty of this problem is the basis for constructing secure post-quantum public-key cryptosystems based on lattices. In this paper, we present a novel massively parallel approach for solving the shortest vector problem using lattice sieving and hardware acceleration. We combine previously reported algorithms with a proper caching strategy and develop hardware architecture. The main advantage of the proposed approach is eliminating the overhead of the data transfer between a CPU and a hardware accelerator. The authors believe that this is the first such architecture reported in the literature to date and predict to achieve up to 8 times higher throughput when compared to a multi-core high-performance CPU. Presented methods can be adapted for other sieving algorithms hard to implement in FPGAs due to the communication and memory bottleneck.  more » « less
Award ID(s):
1801512
NSF-PAR ID:
10281378
Author(s) / Creator(s):
;
Editor(s):
Qiu, Meikang
Date Published:
Journal Name:
International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2020
Volume:
LNCS 12452
Page Range / eLocation ID:
661 - 680
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Hazay, Carmit ; Stam, Martijn (Ed.)
    We study the computational problem of finding a shortest non-zero vector in a rotation of ℤ𝑛 , which we call ℤ SVP. It has been a long-standing open problem to determine if a polynomial-time algorithm for ℤ SVP exists, and there is by now a beautiful line of work showing how to solve it efficiently in certain very special cases. However, despite all of this work, the fastest known algorithm that is proven to solve ℤ SVP is still simply the fastest known algorithm for solving SVP (i.e., the problem of finding shortest non-zero vectors in arbitrary lattices), which runs in 2𝑛+𝑜(𝑛) time. We therefore set aside the (perhaps impossible) goal of finding an efficient algorithm for ℤ SVP and instead ask what else we can say about the problem. E.g., can we find any non-trivial speedup over the best known SVP algorithm? And, if ℤ SVP actually is hard, then what consequences would follow? Our results are as follows. We show that ℤ SVP is in a certain sense strictly easier than SVP on arbitrary lattices. In particular, we show how to reduce ℤ SVP to an approximate version of SVP in the same dimension (in fact, even to approximate unique SVP, for any constant approximation factor). Such a reduction seems very unlikely to work for SVP itself, so we view this as a qualitative separation of ℤ SVP from SVP. As a consequence of this reduction, we obtain a 2𝑛/2+𝑜(𝑛) -time algorithm for ℤ SVP, i.e., the first non-trivial speedup over the best known algorithm for SVP on general lattices. (In fact, this reduction works for a more general class of lattices—semi-stable lattices with not-too-large 𝜆1 .) We show a simple public-key encryption scheme that is secure if (an appropriate variant of) ℤ SVP is actually hard. Specifically, our scheme is secure if it is difficult to distinguish (in the worst case) a rotation of ℤ𝑛 from either a lattice with all non-zero vectors longer than 𝑛/log𝑛‾‾‾‾‾‾‾√ or a lattice with smoothing parameter significantly smaller than the smoothing parameter of ℤ𝑛 . The latter result has an interesting qualitative connection with reverse Minkowski theorems, which in some sense say that “ℤ𝑛 has the largest smoothing parameter.” We show a distribution of bases 𝐁 for rotations of ℤ𝑛 such that, if ℤ SVP is hard for any input basis, then ℤ SVP is hard on input 𝐁 . This gives a satisfying theoretical resolution to the problem of sampling hard bases for ℤ𝑛 , which was studied by Blanks and Miller [9]. This worst-case to average-case reduction is also crucially used in the analysis of our encryption scheme. (In recent independent work that appeared as a preprint before this work, Ducas and van Woerden showed essentially the same thing for general lattices [15], and they also used this to analyze the security of a public-key encryption scheme. Similar ideas also appeared in [5, 11, 20] in different contexts.) We perform experiments to determine how practical basis reduction performs on bases of ℤ𝑛 that are generated in different ways and how heuristic sieving algorithms perform on ℤ𝑛 . Our basis reduction experiments complement and add to those performed by Blanks and Miller, as we work with a larger class of algorithms (i.e., larger block sizes) and study the “provably hard” distribution of bases described above. Our sieving experiments confirm that heuristic sieving algorithms perform as expected on ℤ𝑛 . 
    more » « less
  2. Lattice-based algorithms in cryptanalysis often search for a target vector satisfying integer linear constraints as a shortest or closest vector in some lattice. In this work, we observe that these formulations may discard non-linear information from the underlying application that can be used to distinguish the target vector even when it is far from being uniquely close or short. We formalize lattice problems augmented with a predicate distinguishing a target vector and give algorithms for solving instances of these problems. We apply our techniques to lattice-based approaches for solving the Hidden Number Problem, a popular technique for recovering secret DSA or ECDSA keys in side-channel attacks, and demonstrate that our algorithms succeed in recovering the signing key for instances that were previously believed to be unsolvable using lattice approaches. We carried out extensive experiments using our estimation and solving framework, which we also make available with this work. 
    more » « less
  3. Graph processing recently received intensive interests in light of a wide range of needs to understand relationships. It is well-known for the poor locality and high memory bandwidth requirement. In conventional architectures, they incur a significant amount of data movements and energy consumption which motivates several hardware graph processing accelerators. The current graph processing accelerators rely on memory access optimizations or placing computation logics close to memory. Distinct from all existing approaches, we leverage an emerging memory technology to accelerate graph processing with analog computation. This paper presents GRAPHR, the first ReRAM-based graph processing accelerator. GRAPHR follows the principle of near-data processing and explores the opportunity of performing massive parallel analog operations with low hardware and energy cost. The analog computation is suitable for graph processing because: 1) The algorithms are iterative and could inherently tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and Collaborative Filtering) and typical graph algorithms involving integers (e.g., BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a vertex program of a graph algorithm can be expressed in sparse matrix vector multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We show that this assumption is generally true for a large set of graph algorithms. GRAPHR is a novel accelerator architecture consisting of two components: memory ReRAM and graph engine (GE). The core graph computations are performed in sparse matrix format in GEs (ReRAM crossbars). The vector/matrix-based graph computation is not new, but ReRAM offers the unique opportunity to realize the massive parallelism with unprecedented energy efficiency and low hardware cost. With small subgraphs processed by GEs, the gain of performing parallel operations overshadows the wastes due to sparsity. The experiment results show that GRAPHR achieves a 16.01X (up to 132.67X) speedup and a 33.82X energy saving on geometric mean compared to a CPU baseline system. Compared to GPU, GRAPHR achieves 1.69X to 2.19X speedup and consumes 4.77X to 8.91X less energy. GRAPHR gains a speedup of 1.16X to 4.12X, and is 3.67X to 10.96X more energy efficiency compared to PIM-based architecture. 
    more » « less
  4. Arbitrary-precision integer multiplication is the core kernel of many applications including scientific computing, cryptographic algorithms, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. To leverage the hardware intrinsics low-bit function units (32/64-bit), arbitrary-precision integer multiplication can be calculated using Karatsuba decomposition, and Schoolbook decomposition by decomposing the two large operands into several small operands, generating a set of low-bit multiplications that can be processed either in a spatial or sequential manner on the low-bit function units, e.g., CPU vector instructions, GPU CUDA cores, FPGA digital signal processing (DSP) blocks. Among these accelerators, reconfigurable computing, e.g., FPGA accelerators are promised to provide both good energy efficiency and flexibility. We implement the state-of-the-art (SOTA) FPGA accelerator and compare it with the SOTA libraries on CPUs and GPUs. Surprisingly, in terms of energy efficiency, we find that the FPGA has the lowest energy efficiency, i.e., 0.29x of the CPU and 0.17x of the GPU with the same generation fabrication. Therefore, key questions arise: Where do the energy efficiency gains of CPUs and GPUs come from? Can reconfigurable computing do better? If can, how to achieve that? We first identify that the biggest energy efficiency gains of the CPUs and GPUs come from the dedicated vector units, i.e., vector instruction units in CPUs and CUDA cores in GPUs. FPGA uses DSPs and lookup tables (LUTs) to compose the needed computation, which incurs overhead when compared to using vector units directly. New reconfigurable computing, e.g., “FPGA+vector units” is a novel and feasible solution to improve energy efficiency. In this paper, we propose to map arbitrary-precision integer multiplication onto such a “FPGA+vector units” platform, i.e., AMD/Xilinx Versal ACAP architecture, a heterogeneous reconfigurable computing platform that features 400 AI engine tensor cores (AIE) running at 1 GHz, FPGA programmable logic (PL), and a general-purpose CPU in the system fabricated with the TSMC 7nm technology. Designing on Versal ACAP incurs several challenges and we propose AIM: Arbitrary-precision Integer Multiplication on Versal ACAP to automate and optimize the design. AIM accelerator is composed of AIEs, PL, and CPU. AIM framework includes analytical models to guide design space exploration and AIM automatic code generation to facilitate the system design and on-board design verification. We deploy the AIM framework on three different applications, including large integer multiplication (LIM), RSA, and Mandelbrot, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experimental results show that compared to existing accelerators, AIM achieves up to 12.6x, and 2.1x energy efficiency gains over the Intel Xeon Ice Lake 6346 CPU, and NVidia A5000 GPU respectively, which brings reconfigurable computing the most energy-efficient platform among CPUs and GPUs. 
    more » « less
  5. Abstract

    Quantum annealing is a promising approach to heuristically solving difficult combinatorial optimization problems. However, the connectivity limitations in current devices lead to an exponential degradation of performance on general problems. We propose an architecture for a quantum annealer that achieves full connectivity and full programmability while using a number of physical resources only linear in the number of spins. We do so by application of carefully engineered periodic modulations of oscillator-based qubits, resulting in a Floquet Hamiltonian in which all the interactions are tunable. This flexibility comes at the cost of the coupling strengths between qubits being smaller than they would be compared with direct coupling, which increases the demand on coherence times with increasing problem size. We analyze a specific hardware proposal of our architecture based on Josephson parametric oscillators. Our results show how the minimum-coherence-time requirements imposed by our scheme scale, and we find that the requirements are not prohibitive for fully connected problems with up to at least 1000 spins. Our approach could also have impact beyond quantum annealing, since it readily extends to bosonic quantum simulators, and would allow the study of models with arbitrary connectivity between lattice sites.

     
    more » « less