skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.


Title: Beyond Data and Model Parallelism for Deep Neural Networks
Existing deep learning systems commonly parallelize deep neural network (DNN) training using data or model parallelism, but these strategies often result in suboptimal parallelization performance. We introduce SOAP, a more comprehensive search space of parallelization strategies for DNNs that includes strategies to parallelize a DNN in the Sample, Operator, Attribute, and Parameter dimensions. We present FlexFlow, a deep learning engine that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine. To accelerate this search, FlexFlow introduces a novel execution simulator that can accurately predict a parallelization strategy’s performance and is three orders of magnitude faster than prior approaches that execute each strategy. We evaluate FlexFlow with six real-world DNN benchmarks on two GPU clusters and show that FlexFlow increases training throughput by up to 3.3× over state-of-the-art approaches, even when including its search time, and also improves scalability.  more » « less
Award ID(s):
1651570
NSF-PAR ID:
10129644
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
SysML 2019
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In recent years, the pace of innovations in the fields of machine learning (ML) has accelerated, researchers in SysML have created algorithms and systems that parallelize ML training over multiple devices or computational nodes. As ML models become more structurally complex, many systems have struggled to provide all-round performance on a variety of models. Particularly, ML scale-up is usually underestimated in terms of the amount of knowledge and time required to map from an appropriate distribution strategy to the model. Applying parallel training systems to complex models adds nontrivial development overheads in addition to model prototyping, and often results in lower-than-expected performance. This tutorial identifies research and practical pain points in parallel ML training, and discusses latest development of algorithms and systems on addressing these challenges in both usability and performance. In particular, this tutorial presents a new perspective of unifying seemingly different distributed ML training strategies. Based on it, introduces new techniques and system architectures to simplify and automate ML parallelization. This tutorial is built upon the authors' years' of research and industry experience, comprehensive literature survey, and several latest tutorials and papers published by the authors and peer researchers. The tutorial consists of four parts. The first part will present a landscape of distributed ML training techniques and systems, and highlight the major difficulties faced by real users when writing distributed ML code with big model or big data. The second part dives deep to explain the mainstream training strategies, guided with real use case. By developing a new and unified formulation to represent the seemingly different data- and model- parallel strategies, we describe a set of techniques and algorithms to achieve ML auto-parallelization, and compiler system architectures for auto-generating and exercising parallelization strategies based on models and clusters. The third part of this tutorial exposes a hidden layer of practical pain points in distributed ML training: hyper-parameter tuning and resource allocation, and introduces techniques to improve these aspects. The fourth part is designed as a hands-on coding session, in which we will walk through the audiences on writing distributed training programs in Python, using the various distributed ML tools and interfaces provided by the Ray ecosystem. 
    more » « less
  2. Deep neural network (DNN) accelerators as an example of domain-specific architecture have demonstrated great success in DNN inference. However, the architecture acceleration for equally important DNN training has not yet been fully studied. With data forward, error backward and gradient calculation, DNN training is a more complicated process with higher computation and communication intensity. Because the recent research demonstrates a diminishing specialization return, namely, “accelerator wall”, we believe that a promising approach is to explore coarse-grained parallelism among multiple performance-bounded accelerators to support DNN training. Distributing computations on multiple heterogeneous accelerators to achieve high throughput and balanced execution, however, remaining challenging. We present ACCPAR, a principled and systematic method of determining the tensor partition among heterogeneous accelerator arrays. Compared to prior empirical or unsystematic methods, ACCPAR considers the complete tensor partition space and can reveal previously unknown new parallelism configurations. ACCPAR optimizes the performance based on a cost model that takes into account both computation and communication costs of a heterogeneous execution environment. Hence, our method can avoid the drawbacks of existing approaches that use communication as a proxy of the performance. The enhanced flexibility of tensor partitioning in ACCPAR allows the flexible ratio of computations to be distributed among accelerators with different performances. The proposed search algorithm is also applicable to the emerging multi-path patterns in modern DNNs such as ResNet. We simulate ACCPAR on a heterogeneous accelerator array composed of both TPU-v2 and TPU-v3 accelerators for the training of large-scale DNN models such as Alexnet, Vgg series and Resnet series. The average performance improvements of the state-of-the-art “one weird trick” (OWT) and HYPAR, and ACCPAR, normalized to the baseline data parallelism scheme where each accelerator replicates the model and processes different input data in parallel, are 2.98×, 3.78×, and 6.30×, respectively. 
    more » « less
  3. Neural architecture search (NAS) is a promising technique to design efficient and high-performance deep neural networks (DNNs). As the performance requirements of ML applications grow continuously, the hardware accelerators start playing a central role in DNN design. This trend makes NAS even more complicated and time-consuming for most real applications. This paper proposes FLASH, a very fast NAS methodology that co-optimizes the DNN accuracy and performance on a real hardware platform. As the main theoretical contribution, we first propose the NN-Degree, an analytical metric to quantify the topological characteristics of DNNs with skip connections (e.g., DenseNets, ResNets, Wide-ResNets, and MobileNets). The newly proposed NN-Degree allows us to do training-free NAS within one second and build an accuracy predictor by training as few as 25 samples out of a vast search space with more than 63 billion configurations. Second, by performing inference on the target hardware, we fine-tune and validate our analytical models to estimate the latency, area, and energy consumption of various DNN architectures while executing standard ML datasets. Third, we construct a hierarchical algorithm based on simplicial homology global optimization (SHGO) to optimize the model-architecture co-design process, while considering the area, latency, and energy consumption of the target hardware. We demonstrate that, compared to the state-of-the-art NAS approaches, our proposed hierarchical SHGO-based algorithm enables more than four orders of magnitude speedup (specifically, the execution time of the proposed algorithm is about 0.1 seconds). Finally, our experimental evaluations show that FLASH is easily transferable to different hardware architectures, thus enabling us to do NAS on a Raspberry Pi-3B processor in less than 3 seconds. 
    more » « less
  4. Traditional linear subspace-based reduced order models (LS-ROMs) can be used to significantly accelerate simulations in which the solution space of the discretized system has a small dimension (with a fast decaying Kolmogorov 𝑛-width). However, LS-ROMs struggle to achieve speed-ups in problems whose solution space has a large dimension, such as highly nonlinear problems whose solutions have large gradients. Such an issue can be alleviated by combining nonlinear model reduction with operator learning. Over the past decade, many nonlinear manifold-based reduced order models (NM-ROM) have been proposed. In particular, NM-ROMs based on deep neural networks (DNN) have received increasing interest. This work takes inspiration from adaptive basis methods and specifically focuses on developing an NM-ROM based on Convolutional Neural Network-based autoencoders (CNNAE) with iteration-dependent trainable kernels. Additionally, we investigate DNN-based and quadratic operator inference strategies between latent spaces. A strategy to perform vectorized implicit time integration is also proposed. We demonstrate that the proposed CNN-based NM-ROM, combined with DNN- based operator inference, generally performs better than commonly employed strategies (in terms of prediction accuracy) on a benchmark advection-dominated problem. The method also presents substantial gain in terms of training speed per epoch, with a training time about one order of magnitude smaller than the one associated with a state-of-the-art technique performing with the same level of accuracy. 
    more » « less
  5. Spatial join is an important operation for combining spatial data. Parallelization is essential for improving spatial join performance. However, load imbalance due to data skew limits the scalability of parallel spatial join. There are many work sharing techniques to address this problem in a parallel environment. One of the techniques is to use data and space partitioning and then scheduling the partitions among threads/processes with the goal of minimizing workload differences across threads/processes. However, load imbalance still exists due to differences in join costs of different pairs of input geometries in the partitions. For the load imbalance problem, we have designed a work stealing spatial join system (WSSJ-DM) on a distributed memory environment. Work stealing is an approach for dynamic load balancing in which an idle processor steals computational tasks from other processors. This is the first work that uses work stealing concept (instead of work sharing) to parallelize spatial join computation on a large compute cluster. We have evaluated the scalability of the system on shared and distributed memory. Our experimental evaluation shows that work stealing is an effective strategy. We compared WSSJ-DM with work sharing implementations of spatial join on a high performance computing environment using partitioned and un-partitioned datasets. Static and dynamic load balancing approaches were used for comparison. We study the effect of memory affinity in work stealing operations involved in spatial join on a multi-core processor. WSSJ-DM performed spatial join using ST_Intersection on Lakes (8.4M polygons) and Parks (10M polygons) in 30 seconds using 35 compute nodes on a cluster (1260 CPU cores). A work sharing Master-Worker implementation took 160 seconds in contrast. 
    more » « less