skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: FT-VMP: Fault-Tolerant Virtual Machine Placement in Cloud Data Centers
Virtual machine (VM) replication is an effective technique in cloud data centers to achieve fault-tolerance, load-balance, and quick-responsiveness to user requests. In this paper we study a new fault-tolerant VM placement problem referred to as FT-VMP. Given that different VM has different fault-tolerance requirement (i.e., different VM requires different number of replica copies) and compatibility requirement (i.e., some VMs and their replicas cannot be placed into some physical machines (PMs) due to software or platform incompatibility), FT-VMP studies how to place VM replica copies inside cloud data centers in order to minimize the number of PMs storing VM replicas, under the constraints that i) for fault-tolerant purpose, replica copies of the same VM cannot be placed inside the same PM and ii) each PM has a limited amount of storage capacity. We first prove that FT-VMP is NP-hard. We then design an integer linear programming (ILP)-based algorithm to solve it optimally. As ILP takes time to compute thus is not suitable for large scale cloud data centers, we design a suite of efficient and scalable heuristic fault-tolerant VM placement algorithms. We show that a) ILP-based algorithm outperforms the state-of-the-art VM replica placement in a wide range of network dynamics and b) that all our fault-tolerant VM placement algorithms are able to turn off significant number of PMs to save energy in cloud data centers. In particular, we show that our algorithms can consolidate (i.e., turn off) around 100 PMs in a small data center of 256 PMs and 700 PMs in a large data center of 1028PMs.  more » « less
Award ID(s):
1911191
PAR ID:
10174515
Author(s) / Creator(s):
;
Date Published:
Journal Name:
International Conference on Computer Communications and Networks (ICCCN 2020)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Fault-tolerant virtual machine (VM) placement refers to the process of placing multiple copies of the same VM cloud application inside cloud data centers. The challenge is how to place required number of VM replicas while minimizing the number of physical machines (PMs) that store them, in order to save energy consumption of cloud data centers. We refer to it as fault-tolerant VM placement problem. In our previous work, we have proposed a greedy algorithm to solve this problem. In this paper, we compare it with an existing research that is based on well-known Welsh Powell Graph-Coloring Algorithm to place items into bins while considering the conflicts between items and items and items and bins. Via extensive simulations, we show that our greedy algorithm can turn off 40-50% more PMs than existing work and can place upto four times as many VM replicas as existing work, achieving much stronger fault-tolerance with less energy consumption. We also compare both algorithms with the optimal integer lin- ear programming (ILP)-based algorithm, which serves as the benchmark of the comparison. 
    more » « less
  2. null (Ed.)
    Replicated state machines are linearizable, fault-tolerant groups of replicas that are coordinated using a consensus algorithm. Copilot replication is the first 1-slowdown-tolerant consensus protocol: it delivers normal latency despite the slowdown of any 1 replica. Copilot uses two distinguished replicas—the pilot and copilot—to proactively add redundancy to all stages of processing a client’s command. Copilot uses dependencies and deduplication to resolve potentially differing orderings proposed by the pilots. To avoid dependencies leading to either pilot being able to slow down the group, Copilot uses fast takeovers that allow a fast pilot to complete the ongoing work of a slow pilot. Copilot includes two optimizations—ping-pong batching and null dependency elimination—that improve its performance when there are 0 and 1 slow pilots respectively. Our evaluation of Copilot shows its performance is lower but competitive with Multi-Paxos and EPaxos when no replicas are slow. When a replica is slow, Copilot is the only protocol that avoids high latencies. 
    more » « less
  3. null (Ed.)
    Replicated state machines are linearizable, fault-tolerant groups of replicas that are coordinated using a consensus algorithm. Copilot replication is the first 1-slowdown-tolerant consensus protocol: it delivers normal latency despite the slowdown of any 1 replica. Copilot uses two distinguished replicas—the pilot and copilot—to proactively add redundancy to all stages of processing a client’s command. Copilot uses dependencies and deduplication to resolve potentially differing orderings proposed by the pilots. To avoid dependencies leading to either pilot being able to slow down the group, Copilot uses fast takeovers that allow a fast pilot to complete the ongoing work of a slow pilot. Copilot includes two optimizations—ping-pong batching and null dependency elimination—that improve its performance when there are 0 and 1 slow pilots respectively. Our evaluation of Copilot shows its performance is lower but competitive with Multi-Paxos and EPaxos when no replicas are slow. When a replica is slow, Copilot is the only protocol that avoids high latencies. 
    more » « less
  4. We introduce Canal, a programmable, topic-based, publish/subscribe system that is designed for multi-tier cloud deployments (e.g. edge-cloud, multi-cloud, IoT-cloud, etc.). Canal implements a triggered computational (i.e. “serverless”) programming model and provides developers with a uniform and portable programming interface. To achieve scalability and reliability, Canal combines the use of a distributed hash table (DHT) and replica consensus protocol to distribute and replicate functions, state, and data. Canal also decouples replica placement from the DHT topology to allow developers to optimize function placement for different objectives. We evaluate Canal using a real-world multi-tier IoT deployment and we use Canal to compare placement strategies, end-to-end performance, and failure recovery using both benchmarks and a real-world IoT-edge application. Our results show that Canal is able to achieve both low latency and reliability in this setting. 
    more » « less
  5. We propose a new algorithmic framework for traffic-optimal virtual network function (VNF) placement and migration for policy-preserving data centers (PPDCs). As dy- namic virtual machine (VM) traffic must traverse a sequence of VNFs in PPDCs, it generates more network traffic, consumes higher bandwidth, and causes additional traffic delays than a traditional data center. We design optimal, approximation, and heuristic traffic-aware VNF placement and migration algorithms to minimize the total network traffic in the PPDC. In particular, we propose the first traffic-aware constant-factor approximation algorithm for VNF placement, a Pareto-optimal solution for VNF migration, and a suite of efficient dynamic-programming (DP)-based heuristics that further improves the approximation solution. At the core of our framework are two new graph- theoretical problems that have not been studied. Using flow characteristics found in production data centers and realistic traffic patterns, we show that a) our VNF migration techniques are effective in mitigating dynamic traffic in PPDCs, reducing the total traffic cost by up to 73%, b) our VNF placement algorithms yield traffic costs 56% to 64% smaller than those by existing techniques, and c) our VNF migration algorithms outperform the state-of-the-art VM migration algorithms by up to 63% in reducing dynamic network traffic. 
    more » « less