skip to main content

Title: Self-Learning Threshold-Based Load Balancing
We consider a large-scale service system where incoming tasks have to be instantaneously dispatched to one out of many parallel server pools. The user-perceived performance degrades with the number of concurrent tasks and the dispatcher aims at maximizing the overall quality of service by balancing the load through a simple threshold policy. We demonstrate that such a policy is optimal on the fluid and diffusion scales, while only involving a small communication overhead, which is crucial for large-scale deployments. In order to set the threshold optimally, it is important, however, to learn the load of the system, which may be unknown. For that purpose, we design a control rule for tuning the threshold in an online manner. We derive conditions that guarantee that this adaptive threshold settles at the optimal value, along with estimates for the time until this happens. In addition, we provide numerical experiments that support the theoretical results and further indicate that our policy copes effectively with time-varying demand patterns. Summary of Contribution: Data centers and cloud computing platforms are the digital factories of the world, and managing resources and workloads in these systems involves operations research challenges of an unprecedented scale. Due to the massive size, more » complex dynamics, and wide range of time scales, the design and implementation of optimal resource-allocation strategies is prohibitively demanding from a computation and communication perspective. These resource-allocation strategies are essential for certain interactive applications, for which the available computing resources need to be distributed optimally among users in order to provide the best overall experienced performance. This is the subject of the present article, which considers the problem of distributing tasks among the various server pools of a large-scale service system, with the objective of optimizing the overall quality of service provided to users. A solution to this load-balancing problem cannot rely on maintaining complete state information at the gateway of the system, since this is computationally unfeasible, due to the magnitude and complexity of modern data centers and cloud computing platforms. Therefore, we examine a computationally light load-balancing algorithm that is yet asymptotically optimal in a regime where the size of the system approaches infinity. The analysis is based on a Markovian stochastic model, which is studied through fluid and diffusion limits in the aforementioned large-scale regime. The article analyzes the load-balancing algorithm theoretically and provides numerical experiments that support and extend the theoretical results. « less
; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
INFORMS Journal on Computing
Page Range or eLocation-ID:
39 to 54
Sponsoring Org:
National Science Foundation
More Like this
  1. The increased use of micro-services to build web applications has spurred the rapid growth of Function-as-a-Service (FaaS) or serverless computing platforms. While FaaS simplifies provisioning and scaling for application developers, it introduces new challenges in resource management that need to be handled by the cloud provider. Our analysis of popular serverless workloads indicates that schedulers need to handle functions that are very short-lived, have unpredictable arrival patterns, and require expensive setup of sandboxes. The challenge of running a large number of such functions in a multi-tenant cluster makes existing scheduling frameworks unsuitable. We present Archipelago, a platform that enables low latency request execution in a multi-tenant serverless setting. Archipelago views each application as a DAG of functions, and every DAG in associated with a latency deadline. Archipelago achieves its per-DAG request latency goals by: (1) partitioning a given cluster into a number of smaller worker pools, and associating each pool with a semi-global scheduler (SGS), (2) using a latency-aware scheduler within each SGS along with proactive sandbox allocation to reduce overheads, and (3) using a load balancing layer to route requests for different DAGs to the appropriate SGS, and automatically scale the number of SGSs per DAG. Our testbed resultsmore »show that Archipelago meets the latency deadline for more than 99% of realistic application request workloads, and reduces tail latencies by up to 36X compared to state-of-the-art serverless platforms.« less
  2. We consider a distributed server system consisting of a large number of servers, each with limited capacity on multiple resources (CPU, memory, etc.). Jobs with different rewards arrive over time and require certain amounts of resources for the duration of their service. When a job arrives, the system must decide whether to admit it or reject it, and if admitted, in which server to schedule it. The objective is to maximize the expected total reward received by the system. This problem is motivated by control of cloud computing clusters, in which jobs are requests for virtual machines (VMs) or containers that reserve resources for various services, and rewards represent service priority of requests or price paid per time unit of service. We study this problem in an asymptotic regime where the number of servers and jobs’ arrival rates scale by a factor L, as L becomes large. We propose a resource reservation policy that asymptotically achieves at least 1/2, and under certain monotone property on jobs’ rewards and resources, at least [Formula: see text] of the optimal expected reward. The policy automatically scales the number of VM slots for each job type as the demand changes and decides in whichmore »servers the slots should be created in advance, without the knowledge of traffic rates.« less
  3. With the deployment of artificial intelligent (AI) algorithms in a large variety of applications, there creates an increasing need for high-performance computing capabilities. As a result, different hardware platforms have been utilized for acceleration purposes. Among these hardware-based accelerators, the field-programmable gate arrays (FPGAs) have gained a lot of attention due to their re-programmable characteristics, which provide customized control logic and computing operators. For example, FPGAs have recently been adopted for on-demand cloud services by the leading cloud providers like Amazon and Microsoft, providing acceleration for various compute-intensive tasks. While the co-residency of multiple tenants on a cloud FPGA chip increases the efficiency of resource utilization, it also creates unique attack surfaces that are under-explored. In this paper, we exploit the vulnerability associated with the shared power distribution network on cloud FPGAs. We present a stealthy power attack that can be remotely launched by a malicious tenant, shutting down the entire chip and resulting in denial-of-service for other co-located benign tenants. Specifically, we propose stealthy-shutdown: a well-timed power attack that can be implemented in two steps: (1) an attacker monitors the realtime FPGA power-consumption detected by ring-oscillator-based voltage sensors, and (2) when capturing high power-consuming moments, i.e., the power consumptionmore »by other tenants is above a certain threshold, she/he injects a well-timed power load to shut down the FPGA system. Note that in the proposed attack strategy, the power load injected by the attacker only accounts for a small portion of the overall power consumption; therefore, such attack strategy remains stealthy to the cloud FPGA operator. We successfully implement and validate the proposed attack on three FPGA evaluation kits with running real-world applications. The proposed attack results in a stealthy-shutdown, demonstrating severe security concerns of co-tenancy on cloud FPGAs. We also offer two countermeasures that can mitigate such power attacks.« less
  4. Cloud computing has become an emerging trend for the software industry with the requirement of large infrastructure and resources. The future success of cloud computing depends on the effectiveness of instantiation of the infrastructure and utilization of available resources. Load Balancing ensures the fulfillment of these conditions to improve the cloud environment for the users. Load Balancing dynamically distributes the workload among the nodes in such a way that no single resource is either overwhelmed with tasks or underutilized. In this paper we propose a threshold based load balancing algorithm to ensure the equal distribution of the workload among the nodes. The main objective of the algorithms is to stop the VMs in the cloud being overloaded with tasks or being idle for lack allocation of tasks, when there are active tasks. We have simulated our proposed algorithm in the Cloudanalyst simulator with real world data scenarios. Simulation results shows that our proposed threshold based algorithm can provide a better response time for the task/requests and data processing time for the datacenters compared to the existing algorithms such as First Come First Serve (FCFS), Round Robin(RR) and Equally Spread Current Execution Load Balancing algorithm(ESCELB).
  5. Applications in cloud platforms motivate the study of efficient load balancing under job-server constraints and server heterogeneity. In this paper, we study load balancing on a bipartite graph where left nodes correspond to job types and right nodes correspond to servers, with each edge indicating that a job type can be served by a server. Thus edges represent locality constraints, i.e., an arbitrary job can only be served at servers which contain certain data and/or machine learning (ML) models. Servers in this system can have heterogeneous service rates. In this setting, we investigate the performance of two policies named Join-the-Fastest-of-the-Shortest-Queue (JFSQ) and Join-the-Fastest-of-the-Idle-Queue (JFIQ), which are simple variants of Join-the-Shortest-Queue and Join-the-Idle-Queue, where ties are broken in favor of the fastest servers. Under a "well-connected'' graph condition, we show that JFSQ and JFIQ are asymptotically optimal in the mean response time when the number of servers goes to infinity. In addition to asymptotic optimality, we also obtain upper bounds on the mean response time for finite-size systems. We further show that the well-connectedness condition can be satisfied by a random bipartite graph construction with relatively sparse connectivity.