Supporting Very Large Models using Automatic Dataflow Graph Partitioning

Wang, Minjie; Huang, Chien-chin; Li, Jinyang

doi:10.1145/3302424.3303953

Citation Details

Supporting Very Large Models using Automatic Dataflow Graph Partitioning

This paper presents Tofu, a system that partitions very large DNN models across multiple GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow graph of fine-grained tensor operators used by platforms like MXNet and TensorFlow. In order to automatically partition each operator, we propose to describe the semantics of an operator in a simple language inspired by Halide. To optimally partition different operators in a dataflow graph, Tofu uses a recursive search algorithm that minimizes the total communication cost. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves 25% - 400% speedup over alternative approaches to train very large models. more »

Award ID(s):: 1816717

PAR ID:: 10311676

Author(s) / Creator(s):: Wang, Minjie; Huang, Chien-chin; Li, Jinyang

Date Published:: 2019-03-25

Journal Name:: EuroSys '19: Proceedings of the Fourteenth EuroSys Conference 2019

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3302424.3303953

More Like this