skip to main content


Title: Swizzle Inventor: Data Movement Synthesis for GPU Kernels
Utilizing memory and register bandwidth in modern architectures may require swizzles — non-trivial mappings of data and computations onto hardware resources — such as shuffles. We develop Swizzle Inventor to help programmers implement swizzle programs, by writing program sketches that omit swizzles and delegating their creation to an automatic synthesizer. Our synthesis algorithm scales to real-world programs, allowing us to invent new GPU kernels for stencil computations, matrix transposition, and a finite field multiplication algorithm (used in cryptographic applications). The synthesized 2D convolution and finite field multiplication kernels are on average 1.5–3.2x and 1.1–1.7x faster, respectively, than expert-optimized CUDA kernels.  more » « less
Award ID(s):
1723352 1723445 1651225
NSF-PAR ID:
10113353
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
ASPLOS
Page Range / eLocation ID:
65 to 78
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Fully Homomorphic Encryption over the Torus (TFHE) allows arbitrary computations to happen directly on ciphertexts using homomorphic logic gates. However, each TFHE gate on state-of-the-art hardware platforms such as GPUs and FPGAs is extremely slow (> 0.2ms). Moreover, even the latest FPGA-based TFHE accelerator cannot achieve high energy efficiency, since it frequently invokes expensive double-precision floating point FFT and IFFT kernels. In this paper, we propose a fast and energy-efficient accelerator, MATCHA, to process TFHE gates. MATCHA supports aggressive bootstrapping key unrolling to accelerate TFHE gates without decryption errors by approximate multiplication-less integer FFTs and IFFTs, and a pipelined datapath. Compared to prior accelerators, MATCHA improves the TFHE gate processing throughput by 2.3x, and the throughput per Watt by 6.3x. 
    more » « less
  2. We have developed a memory and operation-count efficient 2.5D inversion algorithm of electrical resistivity (ER) data that can handle fine discretization domains imposed by other geophysical (e.g, ground penetrating radar or seismic) data. Due to numerical stability criteria and available computational memory, joint inversion of different types of geophysical data can impose different grid discretization constraints on the model parameters. Our algorithm enables the ER data sensitivities to be directly joined with other geophysical data without the need of interpolating or coarsening the discretization. We have used the adjoint method directly in the discretized Maxwell’s steady state equation to compute the data sensitivity to the conductivity. In doing so, we make no finite-difference approximation on the Jacobian of the data and avoid the need to store large and dense matrices. Rather, we exploit matrix-vector multiplication of sparse matrices and find successful convergence using gradient descent for our inversion routine without having to resort to the Hessian of the objective function. By assuming a 2.5D subsurface, we are able to linearly reduce memory requirements when compared to a 3D gradient descent inversion, and by a power of two when compared to storing a 2D Hessian. Moreover, our method linearly outperforms operation counts when compared with 3D Gauss-Newton conjugate-gradient schemes, which scales cubically in our favor with respect to the thickness of the 3D domain. We physically appraise the domain of the recovered conductivity using a cutoff of the electric current density present in our survey. We evaluate two case studies to assess the validity of our algorithm. First, on a 2.5D synthetic example, and then on field data acquired in a controlled alluvial aquifer, where we were able to match the recovered conductivity to borehole observations. 
    more » « less
  3. Matrix multiplication is one of the bottleneck computations for training the weights within deep neural networks. To speed up the training phase, we propose to use faster algorithms for matrix multiplication known as Arbitrary Precision Approximating (APA) algorithms. APA algorithms perform asymptotically fewer arithmetic operations than the classical algorithm, but they compute an approximate result with an error that can be made arbitrarily small in exact arithmetic. Practical APA algorithms provide significant reduction in computation time and still provide enough accuracy for many applications like neural network training. We demonstrate that APA algorithms can be efficiently implemented and parallelized for multicore CPUs to obtain up to 28% and 21% speedups over the fastest implementation of the classical algorithm using one core and 12 cores, respectively. Furthermore, using these algorithms to train a Multi-Layer Perceptron (MLP) network yields no significant reduction in the training or testing error. Our performance results on a large MLP network show overall sequential and multithreaded performance improvements of up to 25% and 13%, respectively. We also demonstrate up to 15% improvement when training the fully connected layers of the VGG-19 image classification network. 
    more » « less
  4. We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU. 
    more » « less
  5. null (Ed.)
    We consider the problem of finding lower bounds on the I/O complexity of arbitrary computations in a two level memory hierarchy. Executions of complex computations can be formalized as an evaluation order over the underlying computation graph. However, prior methods for finding I/O lower bounds leverage the graph structures for specific problems (e.g matrix multiplication) which cannot be applied to arbitrary graphs. In this paper, we first present a novel method to bound the I/O of any computation graph using the first few eigenvalues of the graph’s Laplacian. We further extend this bound to the parallel setting. This spectral bound is not only efficiently computable by power iteration, but can also be computed in closed form for graphs with known spectra. We apply our spectral method to compute closed-form analytical bounds on two computation graphs (the Bellman-Held-Karp algorithm for the traveling salesman problem and the Fast Fourier Transform), as well as provide a probabilistic bound for random Erdős Rényi graphs. We empirically validate our bound on four computation graphs, and find that our method provides tighter bounds than current empirical methods and behaves similarly to previously published I/O bounds. 
    more » « less