MapReduce jobs need to shuffle a large amount of data over the network between mapper and reducer nodes. The shuffle time accounts for a big part of the total running time of the MapReduce jobs. Therefore, optimizing the makespan of shuffle phase can greatly improve the performance of MapReduce jobs. A large fraction of production jobs in data centers are recurring with predictable characteristics, and the recurring jobs split the network into periodic busy and idle time slots, which allows us to better schedule the shuffle data in order to reduce the makespan of shuffle phase with the future predictable network status available. In this paper, we formulate the shuffle scheduling problem with the aim to minimize the makespan of MapReduce shuffle phase by leveraging the predictable periodic network status. We then propose a simple yet effective network-aware shuffle scheduling algorithm (NAS) to reduce the number of idle time slots required to transfer the shuffle data so as to reduce the shuffle makespan. We also prove that the proposed algorithm NAS is a 32 -approximation algorithm to the shuffle scheduling problem when all the future idle time slots have the same duration. We finally conduct experiments through simulations. Experimental results demonstrate the proposed algorithm can effectively reduce the makespan of MapReduce shuffle phase and increase network utilization.
more »
« less
FLCD: A Flexible Low Complexity Design of Coded Distributed Computing
We propose a flexible low complexity design (FLCD) of coded distributed computing (CDC) with empirical evaluation on Amazon Elastic Compute Cloud (Amazon EC2). CDC can expedite MapReduce like computation by trading increased map computations to reduce communication load and shuffle time. A main novelty of FLCD is to utilize the design freedom in defining map and reduce functions to develop asymptotic homogeneous systems to support varying intermediate values (IV) sizes under a general MapReduce framework. Compared to existing designs with constant IV sizes, FLCD offers greater flexibility in adapting to network parameters and significantly reduces the implementation complexity by requiring fewer input files and shuffle groups. The FLCD scheme is the first proposed low-complexity CDC design that can operate on a network with an arbitrary number of nodes and computation load. We perform empirical evaluations of the FLCD by executing the TeraSort algorithm on an Amazon EC2 cluster. This is the first time that theoretical predictions of the CDC shuffle time are validated by empirical evaluations. The evaluations demonstrate a 2.0 to 4.24 speedup compared to conventional uncoded MapReduce, a 12% to 52% reduction in total time, and a wider range of operating network parameters compared to existing CDC schemes.
more »
« less
- Award ID(s):
- 1817154
- PAR ID:
- 10297794
- Date Published:
- Journal Name:
- IEEE Transactions on Cloud Computing
- ISSN:
- 2372-0018
- Page Range / eLocation ID:
- 1 to 1
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Our extensive real measurements over Amazon EC2 show that the virtual instances often have different computing speeds even if they share the same configurations. This motivates us to study heterogeneous Coded Storage Elastic Computing (CSEC) systems where machines, with different computing speeds, join and leave the network arbitrarily over different computing steps. In CSEC systems, a Maximum Distance Separable (MDS) code is used for coded storage such that the file placement does not have to be re-defined with each elastic event. Computation assignment algorithms are used to minimize the computation time given computation speeds of different machines. While previous studies of heterogeneous CSEC do not include stragglers - the slow machines during the computation, we develop a new framework in heterogeneous CSEC that introduces straggler tolerance. Based on this framework, we design a novel algorithm using our previously proposed approach for heterogeneous CSEC such that the system can handle any subset of stragglers of a specified size while minimizing the computation time. Furthermore, we establish a trade-off in computation time and straggler tolerance. Another major limitation of existing CSEC designs is the lack of practical evaluations using real applications. In this paper, we evaluate the performance of our designs on Amazon EC2 for applications of the power iteration and linear regression. Evaluation results show that the proposed heterogeneous CSEC algorithms outperform the state-of-the-art designs by more than 30%.more » « less
-
Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters. To tackle this issue, in this paper, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process. Prototype experiments with representative DNN models trained on Amazon EC2 demonstrate that Prophet can improve the DDNN training performance by up to 40% compared with the state-of-theart priority-based communication scheduling strategies, yet with negligible runtime performance overhead.more » « less
-
Despite extensive investigation of job scheduling in data-intensive computation frameworks, less consideration has been given to optimizing job partitioning for resource utilization and efficient processing. Instead, partitioning and job sizing are a form of dark art, typically left to developer intuition and trial-and-error style experimentation. In this work, we propose that just as job scheduling and resource allocation are out-sourced to a trusted mechanism external to the workload, so too should be the responsibility for partitioning data as a determinant for task size. Job partitioning essentially involves determining the partition sizes to match the resource allocation at the finest granularity. This is a complex, multi-dimensional problem that is highly application specific: resource allocation, computational runtime, shuffle and reduce communication requirements, and task startup overheads all have strong influence on the most effective task size for efficient processing. Depending on the partition size, the job completion time can differ by as much as 10 times! Fortunately, we observe a general trend underlying the tradeoff between full resource utilization and system overhead across different settings. The optimal job partition size balances these two conflicting forces. Given this trend, we design Libra to automate job partitioning as a framework extension. We integrate Libra with Spark and evaluate its performance on EC2. Compared to state-of-the-art techniques, Libra can reduce the individual job execution time by 25% to 70%.more » « less
-
Elasticity is one important feature in modern cloud computing systems and can result in computation failure or significantly increase computing time. Such elasticity means that virtual machines over the cloud can be preempted under a short notice (e.g., hours or minutes) if a high-priority job appears; on the other hand, new virtual machines may become available over time to compensate the computing resources. Coded Storage Elastic Computing (CSEC) introduced by Yang et al. in 2018 is an effective and efficient approach to overcome the elasticity and it costs relatively less storage and computation load. However, one of the limitations of the CSEC is that it may only be applied to certain types of computations (e.g., linear) and may be challenging to be applied to more involved computations because the coded data storage and approximation are often needed. Hence, it may be preferred to use uncoded storage by directly copying data into the virtual machines. In addition, based on our own measurement, virtual machines on Amazon EC2 clusters often have heterogeneous computation speed even if they have exactly the same configurations (e.g., CPU, RAM, I/O cost). In this paper, we introduce a new optimization framework on Uncoded Storage Elastic Computing (USEC) systems with heterogeneous computing speed to minimize the overall computation time. Under this framework, we propose optimal solutions of USEC systems with or without straggler tolerance using different storage placements. Our proposed algorithms are evaluated using power iteration applications on Amazon EC2.more » « less
An official website of the United States government

