skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Supporting Very Large Models using Automatic Dataflow Graph Partitioning
This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models.  more » « less
Award ID(s):
1816717
PAR ID:
10311676
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. It is known that deeper and wider neural networks can achieve better accuracy. But it is difficult to continue the trend to increase model size due to limited GPU memory. One promising solution is to support swapping between GPU and CPU memory. However, existing work on swapping only handle certain models and do not achieve satisfactory performance. Deep learning computation is commonly expressed as a dataflow graph which can be analyzed to improve swapping. We propose SwapAdvisor, which performs joint optimization along 3 dimensions based on a given dataflow graph: operator scheduling, memory allocation, and swap decisions. SwapAdvisor explores the vast search space using a custom-designed genetic algorithm. Evaluations using a variety of large models show that SwapAdvisor can train models up to 12 times the GPU memory limit while achieving 53-99% of the throughput of a hypothetical baseline with infinite GPU memory. 
    more » « less
  2. Existing GPU graph analytics frameworks are typically built from specialized, bottom-up implementations of graph operators that are customized to graph computation. In this work we describe Mini-Gunrock, a lightweight graph analytics framework on the GPU. Unlike existing frameworks, Mini-Gunrock is built from graph operators implemented with generic transform-based data-parallel primitives. Using this method to bridge the gap between programmability and high performance for GPU graph analytics, we demonstrate operator performance on scale-free graphs with an average 1.5x speedup compared to Gunrock's corresponding operator performance. Mini-Gunrock's graph operators, optimizations, and applications code have 10x smaller code size and comparable overall performance vs. Gunrock. 
    more » « less
  3. Neural-network-enabled data analysis in real-time scientific applications imposes stringent requirements on inference latency. Meanwhile, recent deep learning (DL) model design trends to replace a single branch with multiple branches for high prediction accuracy and robustness, which makes interoperator parallelization become an effective approach to improve inference latency. However, existing inter-operator parallelization techniques for inference acceleration are mainly focused on utilization optimization in a single GPU. With the data size of an input sample and the scale of a DL model ever-growing, the limited resource of a single GPU is insufficient to support the parallel execution of large operators. In order to break this limitation, we study hybrid inter-operator parallelism both among multiple GPUs and in each GPU. In this paper, we design and implement a hierarchical inter-operator scheduler (HIOS) to automatically distribute large operators onto different GPUs and group small operators in the same GPU for parallel execution. Particularly, we propose a novel scheduling algorithm, named HIOS-LP, which consists of inter-GPU operator parallelization through iterative longest-path (LP) mapping and intra-GPU operator parallelization based on a sliding window. In addition to extensive simulation results, experiments with modern convolutional neural network benchmarks demonstrate that our HIOS-LP outperforms the state-of-the-art inter-operator scheduling algorithm IOS by up to 17% in real systems. 
    more » « less
  4. Coarse-grained reconfigurable arrays (CGRAs) have gained attention in recent years due to their promising power efficiency compared to traditional von Neumann architectures. To program these architectures using ordinary languages such as C, a dataflow compiler must transform the original sequential, imperative program into an equivalent dataflow graph, composed of dataflow operators running in parallel. This transformation is challenging since the asynchronous nature of dataflow graphs allows out-of-order execution of operators, leading to behaviors not present in the original imperative programs. We address this challenge by developing a translation validation technique for dataflow compilers to ensure that the dataflow program has the same behavior as the original imperative program on all possible inputs and schedules of execution. We apply this method to a state-of-the-art dataflow compiler targeting the RipTide CGRA architecture. Our tool uncovers 8 compiler bugs where the compiler outputs incorrect dataflow graphs, including a data race that is otherwise hard to discover via testing. After repairing these bugs, our tool verifies the correct compilation of all programs in the RipTide benchmark suite. 
    more » « less
  5. GPUs have become ubiquitous in the cloud due to the dramatic performance gains they enable in domains such as machine learning and computer vision. However, offloading GPU computation to the cloud requires placing enormous trust in providers and administrators. Recent proposals for GPU trusted execution environments (TEEs) are promising but fail to address very real side-channel concerns. To illustrate the severity of the problem, we demonstrate a novel attack that enables an attacker to correctly classify images from ImageNet by observing only the timing of GPU kernel execution, rather than the images themselves. Telekine enables applications to use GPU acceleration in the cloud securely, based on a novel GPU stream abstraction that ensures execution and interaction through untrusted components are independent of any secret data. Given a GPU with support for a TEE, Telekine employs a novel variant of API remoting to partition application-level software into components to ensure secret-dependent behaviors occur only on trusted components. Telekine can securely train modern image recognition models on MXNet with 10%–22% performance penalty relative to an insecure baseline with a locally attached GPU. It runs graph algorithms using Galois on one and two GPUs with 18%–41% overhead. 
    more » « less