skip to main content

Title: NeuroVectorizer: end-to-end vectorization with deep reinforcement learning
One of the key challenges arising when compilers vectorize loops for today’s SIMD-compatible architectures is to decide if vectorization or interleaving is beneficial. Then, the compiler has to determine the number of instructions to pack together and the interleaving level (stride). Compilers are designed today to use fixed-cost models that are based on heuristics to make vectorization decisions on loops. However, these models are unable to capture the data dependency, the computation graph, or the organization of instructions. Alternatively, software engineers often hand-write the vectorization factors of every loop. This, however, places a huge burden on them, since it requires prior experience and significantly increases the development time. In this work, we explore a novel approach for handling loop vectorization and propose an end-to-end solution using deep reinforcement learning (RL). We conjecture that deep RL can capture different instructions, dependencies, and data structures to enable learning a sophisticated model that can better predict the actual performance cost and determine the optimal vectorization factors. We develop an end-to-end framework, from code to vectorization, that integrates deep RL in the LLVM compiler. Our proposed framework takes benchmark codes as input and extracts the loop codes. These loop codes are then fed to a loop embedding generator that learns an embedding for these loops. Finally, the learned embeddings are used as input to a Deep RL agent, which dynamically determines the vectorization factors for all the loops. We further extend our framework to support random search, decision trees, supervised neural networks, and nearest-neighbor search. We evaluate our approaches against the currently used LLVM vectorizer and loop polyhedral optimization techniques. Our experiments show 1.29×−4.73× performance speedup compared to baseline and only 3% worse than the brute-force search on a wide range of benchmarks.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization
Page Range / eLocation ID:
242 to 255
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Because of the increasing demand for intensive computation in deep neural networks, researchers have developed both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to benefit from mixed precision without hardware specialization because of the overhead of data casting. Recently, hardware vendors offer tensorized instructions specialized for mixed-precision tensor operations, such as Intel VNNI, Nvidia Tensor Core, and ARM DOT. These instructions involve a new computing idiom, which reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this emerging idiom makes it hard to utilize these instructions. In practice, one approach is to use vendor-provided libraries for computationally-intensive kernels, but this is inflexible and prevents further optimizations. Another approach is to manually write hardware intrinsics, which is error-prone and difficult for programmers. Some prior works tried to address this problem by creating compilers for each instruction. This requires excessive efforts when it comes to many tensorized instructions. In this work, we develop a compiler framework, UNIT, to unify the compilation for tensorized instructions. The key to this approach is a unified semantics abstraction which makes the integration of new instructions easy, and the reuse of the analysis and transformations possible. Tensorized instructions from different platforms can be compiled via UNIT with moderate effort for favorable performance. Given a tensorized instruction and a tensor operation, UNIT automatically detects the applicability of the instruction, transforms the loop organization of the operation, and rewrites the loop body to take advantage of the tensorized instruction. According to our evaluation, UNIT is able to target various mainstream hardware platforms. The generated end-to-end inference model achieves 1.3 x speedup over Intel oneDNN on an x86 CPU, 1.75x speedup over Nvidia cuDNN on an Nvidia GPU, and 1.13x speedup over a carefully tuned TVM solution for ARM DOT on an ARM CPU. 
    more » « less
  2. This paper presents Giallar, a fully-automated verification toolkit for quantum compilers. Giallar requires no manual specifications, invariants, or proofs, and can automatically verify that a compiler pass preserves the semantics of quantum circuits. To deal with unbounded loops in quantum compilers, Giallar abstracts three loop templates, whose loop invariants can be automatically inferred. To efficiently check the equivalence of arbitrary input and output circuits that have complicated matrix semantics representation, Giallar introduces a symbolic representation for quantum circuits and a set of rewrite rules for showing the equivalence of symbolic quantum circuits. With Giallar, we implemented and verified 44 (out of 56) compiler passes in 13 versions of the Qiskit compiler, the open-source quantum compiler standard, during which three bugs were detected in and confirmed by Qiskit. Our evaluation shows that most of Qiskit compiler passes can be automatically verified in seconds and verification imposes only a modest overhead to compilation performance. 
    more » « less
  3. Existing approaches for autonomous control of pan-tilt-zoom (PTZ) cameras use multiple stages where object detection and localization are performed separately from the control of the PTZ mechanisms. These approaches require manual labels and suffer from performance bottlenecks due to error propagation across the multi-stage flow of information. The large size of object detection neural networks also makes prior solutions infeasible for real-time deployment in resource-constrained devices. We present an end-to-end deep reinforcement learning (RL) solution called Eagle1 to train a neural network policy that directly takes images as input to control the PTZ camera. Training reinforcement learning is cumbersome in the real world due to labeling effort, runtime environment stochasticity, and fragile experimental setups. We introduce a photo-realistic simulation framework for training and evaluation of PTZ camera control policies. Eagle achieves superior camera control performance by maintaining the object of interest close to the center of captured images at high resolution and has up to 17% more tracking duration than the state-of-the-art. Eagle policies are lightweight (90x fewer parameters than Yolo5s) and can run on embedded camera platforms such as Raspberry PI (33 FPS) and Jetson Nano (38 FPS), facilitating real-time PTZ tracking for resource-constrained environments. With domain randomization, Eagle policies trained in our simulator can be transferred directly to real-world scenarios2. 
    more » « less
  4. With benefits of fast query speed and low storage cost, hashing-based image retrieval approaches have garnered considerable attention from the research community. In this paper, we propose a novel Error-Corrected Deep Cross Modal Hashing (CMH-ECC) method which uses a bitmap specifying the presence of certain facial attributes as an input query to retrieve relevant face images from the database. In this architecture, we generate compact hash codes using an end-to-end deep learning module, which effectively captures the inherent relationships between the face and attribute modality. We also integrate our deep learning module with forward error correction codes to further reduce the distance between different modalities of the same subject. Specifically, the properties of deep hashing and forward error correction codes are exploited to design a cross modal hashing framework with high retrieval performance. Experimental results using two standard datasets with facial attributes-image modalities indicate that our CMH-ECC face image retrieval model outperforms most of the current attribute-based face image retrieval approaches. 
    more » « less
  5. Convolutional neural networks (CNNs) are becoming increasingly deeper, wider, and non-linear because of the growing demand on prediction accuracy and analysis quality. The wide and deep CNNs, however, require a large amount of computing resources and processing time. Many previous works have studied model pruning to improve inference performance, but little work has been done for effectively reducing training cost. In this paper, we propose ClickTrain: an efficient and accurate end-to-end training and pruning framework for CNNs. Different from the existing pruning-during-training work, ClickTrain provides higher model accuracy and compression ratio via fine-grained architecture-preserving pruning. By leveraging pattern-based pruning with our proposed novel accurate weight importance estimation, dynamic pattern generation and selection, and compiler-assisted computation optimizations, ClickTrain generates highly accurate and fast pruned CNN models for direct deployment without any extra time overhead, compared with the baseline training. ClickTrain also reduces the end-to-end time cost of the pruning-after-training method by up to 2.3X with comparable accuracy and compression ratio. Moreover, compared with the state-of-the-art pruning-during-training approach, ClickTrain provides significant improvements both accuracy and compression ratio on the tested CNN models and datasets, under similar limited training time. 
    more » « less