skip to main content


Search for: All records

Creators/Authors contains: "Ali-Eldin, Ahmed"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Traditional power reduction techniques such as DVFS or RAPL are challenging to use with web services because they significantly affect the services’ latency and throughput. Previous work sug- gested the use of controllers based on control theory or machine learning to reduce performance degradation under constrained power. However, generating these controllers is challenging as ev- ery web service applications running in a data center requires a power-performance model and a fine-tuned controller. In this paper, we present DDPC, a system for autonomic data-driven controller generation for power-latency management. DDPC automates the process of designing and deploying controllers for dynamic power allocation to manage the power-performance trade-offs for latency- sensitive web applications such as a social network. For each application, DDPC uses system identification techniques to learn an adaptive power-performance model that captures the application’s power-latency trade-offs which is then used to generate and deploy a Proportional-Integral (PI) power controller with gain-scheduling to dynamically manage the power allocation to the server running application using RAPL. We evaluate DDPC with two realistic latency-sensitive web applications under varying load scenarios. Our results show that DDPC is capable of autonomically generating and deploying controllers within a few minutes reducing the ac- tive power allocation of a web-server by more than 50% compared to state-of-the-art techniques while maintaining the latency well below the target of the application. 
    more » « less
  2. Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud plat- forms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multi- ple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substan- tially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evalu- ate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adapta- tion policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications. 
    more » « less
  3. Since emerging edge applications such as Internet of Things (IoT) analytics and augmented reality have tight latency constraints, hardware AI accelerators have been recently proposed to speed up deep neural network (DNN) inference run by these applications. Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads. In this article, we design analytic models to capture the performance of DNN inference workloads on shared edge accelerators, such as GPU and edgeTPU, under different multiplexing and concurrency behaviors. After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting their latency constraints. We implement a prototype of our system in Kubernetes and show that our system can host 2.3× more DNN applications in heterogeneous multi-tenant edge clusters with no latency violations when compared to traditional knapsack hosting algorithms. 
    more » « less
  4. Edge computing has emerged as a popular paradigm for running latency-sensitive applications due to its ability to offer lower network latencies to end-users. In this paper, we argue that despite its lower network latency, the resource-constrained nature of the edge can result in higher end-to-end latency, especially at higher utilizations, when compared to cloud data centers. We study this edge performance inversion problem through an analytic comparison of edge and cloud latencies and analyze conditions under which the edge can yield worse performance than the cloud. To verify our analytic results, we conduct a detailed experimental comparison of the edge and the cloud latencies using a realistic application and real cloud workloads. Both our analytical and experimental results show that even at moderate utilizations, the edge queuing delays can offset the benefits of lower network latencies, and even result in performance inversion where running in the cloud would provide superior latencies. We finally discuss practical implications of our results and provide insights into how application designers and service providers should design edge applications and systems to avoid these pitfalls. 
    more » « less
  5. null (Ed.)
    Serverless computing has emerged as a new paradigm for running short-lived computations in the cloud. Due to its ability to handle IoT workloads, there has been considerable interest in running serverless functions at the edge. However, the constrained nature of the edge and the latency sensitive nature of workloads result in many challenges for serverless platforms. In this paper, we present LaSS, a platform that uses model-driven approaches for running latency-sensitive serverless computations on edge resources. LaSS uses principled queuing-based methods to determine an appropriate allocation for each hosted function and auto-scales the allocated resources in response to workload dynamics. LaSS uses a fair-share allocation approach to guarantee a minimum of allocated resources to each function in the presence of overload. In addition, it utilizes resource reclamation methods based on container deflation and termination to reassign resources from over-provisioned functions to under-provisioned ones. We implement a prototype of our approach on an OpenWhisk serverless edge cluster and conduct a detailed experimental evaluation. Our results show that LaSS can accurately predict the resources needed for serverless functions in the presence of highly dynamic workloads, and reprovision container capacity within hundreds of milliseconds while maintaining fair share allocation guarantees. 
    more » « less
  6. Transient computing has become popular in public cloud environments for running delay-insensitive batch and data processing applications at low cost. Since transient cloud servers can be revoked at any time by the cloud provider, they are considered unsuitable for running interactive application such as web services. In this paper, we present VM deflation as an alternative mechanism to server preemption for reclaiming resources from transient cloud servers under resource pressure. Using real traces from top-tier cloud providers, we show the feasibility of using VM deflation as a resource reclamation mechanism for interactive applications in public clouds. We show how current hypervisor mechanisms can be used to implement VM deflation and present cluster deflation policies for resource management of transient and on-demand cloud VMs. Experimental evaluation of our deflation system on a Linux cluster shows that microservice-based applications can be deflated by up to 50% with negligible performance overhead. Our cluster-level deflation policies allow overcommitment levels as high as 50%, with less than a 1% decrease in application throughput, and can enable cloud platforms to increase revenue by 30% 
    more » « less
  7. Data centers and clouds are increasingly offering low-cost computational resources in the form of transient virtual machines. Whenever demand for computational resources exceeds their availability, transient resources can reclaimed by preempting the transient VMs. Conventionally, these transient VMs are used by low-priority applications that can tolerate the disruption caused by preemptions. In this paper we propose an alternative approach for reclaiming resources, called resource deflation. Resource deflation allows applications to dynamically shrink (and expand) in response to resource pressure, instead of being preempted outright. Deflatable VMs allow applications to continue running even under resource pressure, and increase the utility of low-priority transient resources. Deflation uses a dynamic, multi-level cascading reclamation technique that allows applications, operating systems, and hypervisors to implement their own policies for handling resource pressure. For distributed data processing, machine learning, and deep neural network training, our multi-level approach reduces the performance degradation by up to 2x compared to existing preemption-based approaches. When deflatable VMs are deployed on a cluster, our policies allow up to 1.6x utilization without the risk of preemption. 
    more » « less
  8. Amazon introduced spot instances in December 2009, enabling “customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price.” Amazon’s real-time computational spot market was novel in multiple respects. For example, it was the first (and to date only) large-scale public implementation of market-based resource allocation based on dynamic pricing after decades of research, and it provided users with useful information, control knobs, and options for optimizing the cost of running cloud applications. Spot instances also introduced the concept of transient cloud servers derived from variable idle capacity that cloud platforms could revoke at any time. Transient servers have since become central to efficient resource management of modern clusters and clouds. As a result, Amazon’s spot market was the motivation for substantial research over the past decade. Yet, in November 2017, Amazon effectively ended its real-time spot market by announcing that users no longer needed to place bids and that spot prices will “...adjust more gradually, based on longer-term trends in supply and demand.” The changes made spot instances more similar to the fixed-price transient servers offered by other cloud platforms. Unfortunately, while these changes made spot instances less complex, they eliminated many benefits to sophisticated users in optimizing their applications. This paper provides a retrospective on Amazon’s real-time spot market, including its advantages and disadvantages for allocating transient servers compared to current fixed-price approaches. We also discuss some fundamental problems with Amazon’s spot market, which we identified in prior work (from 2016), that predicted its eventual end. We then discuss potential options for allocating transient servers that combine the advantages of Amazon’s real-time spot market, while also addressing the problems that likely led to its elimination. 
    more » « less
  9. Cloud platforms often execute parallel batch applications, such as distributed machine learning (ML), that include numerous synchronization barriers. These barriers, which prevent any task from advancing beyond a specified point until all tasks have reached that point, significantly degrade application performance by reducing it to that of the slowest "straggler" task. To address the problem, researchers have proposed numerous straggler mitigation techniques, including speculatively re-executing straggler tasks and various relaxations of strict barrier semantics. While these techniques improve parallel application performance, they incur a cost in terms of the resources wasted re-executing tasks or waiting. Importantly, these costs, which are often implicit in prior work that targets dedicated resources, become explicit in the cloud, which charges for resources at fine-grained intervals. In addition, the cost difference between techniques is exacerbated in cloud platforms, since they charge substantially less for transient resources that effectively yield a probabilistic performance across a wide range. While transient resources' low list price is attractive, revocations increase the frequency and severity of stragglers, which decreases parallel job performance and increases overall execution cost. To better understand the cost of synchronization, we develop simple analytical models of different straggler mitigation techniques and compare their cost and performance on on-demand and transient resources. Our analysis shows that i) transient servers offer complex tradeoffs compared to on-demand servers, and can result in higher overall costs despite their highly discounted price due to their probabilistic performance; ii) common approaches to straggler mitigation, which is a well-studied problem, are less effective using transient servers that cause frequent and severe stragglers; and iii) a recent approach to flexible synchronization offers the best cost and performance. 
    more » « less