skip to main content


Title: MIRAS: Model-based Reinforcement Learning for Microservice Resource Allocation over Scientific Workflows
Microservice, an architectural design that decomposes applications into loosely coupled services, is adopted in modern software design, including cloud-based scientific workflow processing. The microservice design makes scientific workflow systems more modular, more flexible, and easier to develop. However, cloud deployment of microservice workflow execution systems doesn't come for free, and proper resource management decisions have to be made in order to achieve certain performance objective (e.g., response time) within constraint operation cost. Nevertheless, effective online resource allocation decisions are hard to achieve due to dynamic workloads and the complicated interactions of microservices in each workflow. In this paper, we propose an adaptive resource allocation approach for microservice workflow system based on recent advances in reinforcement learning. Our approach (1) assumes little prior knowledge of the microservice workflow system and does not require any elaborately designed model or crafted representative simulator of the underlying system, and (2) avoids high sample complexity which is a common drawback of model-free reinforcement learning when applied to real-world scenarios. We show that our proposed approach automatically achieves effective policy for resource allocation with limited number of time-consuming interactions with the microservice workflow system. We perform extensive evaluations to validate the effectiveness of our approach and demonstrate that it outperforms existing resource allocation approaches with read-world emulated workflows.  more » « less
Award ID(s):
1827126
NSF-PAR ID:
10192135
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)
Page Range / eLocation ID:
122 to 132
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Edge Cloud (EC) is poised to brace massive machine type communication (mMTC) for 5G and IoT by providing compute and network resources at the edge. Yet, the EC being regionally domestic with a smaller scale, faces the challenges of bandwidth and computational throughput. Resource management techniques are considered necessary to achieve efficient resource allocation objectives. Software Defined Network (SDN) enabled EC architecture is emerging as a potential solution that enables dynamic bandwidth allocation and task scheduling for latency sensitive and diverse mobile applications in the EC environment. This study proposes a novel Heuristic Reinforcement Learning (HRL) based flowlevel dynamic bandwidth allocation framework and validates it through end-to-end implementation using OpenFlow meter feature. OpenFlow meter provides granular control and allows demand-based flow management to meet the diverse QoS requirements germane to IoT traffics. The proposed framework is then evaluated by emulating an EC scenario based on real NSF COSMOS testbed topology at The City College of New York. A specific heuristic reinforcement learning with linear-annealing technique and a pruning principle are proposed and compared with the baseline approach. Our proposed strategy performs consistently in both Mininet and hardware OpenFlow switches based environments. The performance evaluation considers key metrics associated with real-time applications: throughput, end-to-end delay, packet loss rate, and overall system cost for bandwidth allocation. Furthermore, our proposed linear annealing method achieves faster convergence rate and better reward in terms of system cost, and the proposed pruning principle remarkably reduces control traffic in the network. 
    more » « less
  2. The HPC community is actively researching and evaluating tools to support execution of scientific applications in cloud-based environ- ments. Among the various technologies, containers have recently gained importance as they have significantly better performance compared to full-scale virtualization, support for microservices and DevOps, and work seamlessly with workflow and orchestration tools. Docker is currently the leader in containerization technology because it offers low overhead, flexibility, portability of applications, and reproducibility. Singularity is another container solution that is of interest as it is designed specifically for scientific applications. It is important to conduct performance and feature analysis of the container technologies to understand their applicability for each application and target execution environment. This paper presents a (1) performance evaluation of Docker and Singularity on bare metal nodes in the Chameleon cloud (2) mecha- nism by which Docker containers can be mapped with InfiniBand hardware with RDMA communication and (3) analysis of mapping elements of parallel workloads to the containers for optimal re- source management with container-ready orchestration tools. Our experiments are targeted toward application developers so that they can make informed decisions on choosing the container tech- nologies and approaches that are suitable for their HPC workloads on cloud infrastructure. Our performance analysis shows that sci- entific workloads for both Docker and Singularity based containers can achieve near-native performance. Singularity is designed specifically for HPC workloads. However, Docker still has advantages over Singularity for use in clouds as it provides overlay networking and an intuitive way to run MPI applications with one container per rank for fine-grained resources allocation. Both Docker and Singularity make it possible to directly use the underlying network fabric from the containers for coarse- grained resource allocation. 
    more » « less
  3. Scientific research and development campaigns are materialized by workflows of applications executing on high-performance computing (HPC) systems. These applications con-sist of tasks that can have inter- or intra-application flows of data to achieve the research goals successfully. These dataflows create dependencies among the tasks and cause resource con-tention on shared storage systems, thus limiting the aggregated I/O bandwidth achieved by the workflow. However, these I/O performance issues are often solved by tedious and manual efforts that demand holistic knowledge about the data dependencies in the workflow and the information about the infrastructure being utilized. Taking this into consideration, we design DFMan, a graph-based dataflow management and optimization framework for maximizing I/O bandwidth by leveraging the powerful storage stack on HPC systems to manage data sharing optimally among the tasks in the workflows. In particular, we devise a graph-based optimization algorithm that can leverage an intuitive graph representation of dataflow- and system-related information, and automatically carry out co-scheduling of task and data placement. According to our experiments, DFMan optimizes a wide variety of scientific workflows such as Hurricane 3D on Cloud Model 1 (CM1), Montage Carina Nebula (NGC3372), and an emulated dataflow kernel of the Multiscale Machine-learned Modeling Infrastructure (MuMMI I/O) on the Lassen supercomputer, and improves their aggregated I/O bandwidth by up to 5.42 x, 2.12 x and 1.29 x, respectively, compared to the baseline bandwidth. 
    more » « less
  4. We consider a large-scale service system where incoming tasks have to be instantaneously dispatched to one out of many parallel server pools. The user-perceived performance degrades with the number of concurrent tasks and the dispatcher aims at maximizing the overall quality of service by balancing the load through a simple threshold policy. We demonstrate that such a policy is optimal on the fluid and diffusion scales, while only involving a small communication overhead, which is crucial for large-scale deployments. In order to set the threshold optimally, it is important, however, to learn the load of the system, which may be unknown. For that purpose, we design a control rule for tuning the threshold in an online manner. We derive conditions that guarantee that this adaptive threshold settles at the optimal value, along with estimates for the time until this happens. In addition, we provide numerical experiments that support the theoretical results and further indicate that our policy copes effectively with time-varying demand patterns. Summary of Contribution: Data centers and cloud computing platforms are the digital factories of the world, and managing resources and workloads in these systems involves operations research challenges of an unprecedented scale. Due to the massive size, complex dynamics, and wide range of time scales, the design and implementation of optimal resource-allocation strategies is prohibitively demanding from a computation and communication perspective. These resource-allocation strategies are essential for certain interactive applications, for which the available computing resources need to be distributed optimally among users in order to provide the best overall experienced performance. This is the subject of the present article, which considers the problem of distributing tasks among the various server pools of a large-scale service system, with the objective of optimizing the overall quality of service provided to users. A solution to this load-balancing problem cannot rely on maintaining complete state information at the gateway of the system, since this is computationally unfeasible, due to the magnitude and complexity of modern data centers and cloud computing platforms. Therefore, we examine a computationally light load-balancing algorithm that is yet asymptotically optimal in a regime where the size of the system approaches infinity. The analysis is based on a Markovian stochastic model, which is studied through fluid and diffusion limits in the aforementioned large-scale regime. The article analyzes the load-balancing algorithm theoretically and provides numerical experiments that support and extend the theoretical results. 
    more » « less
  5. The past decades witnessed the fast and wide deployment of Internet. The Internet has bred the ubiquitous computing environment that is spanning the cloud, edge, mobile devices, and IoT. Software running over such a ubiquitous computing environment environment is eating the world. A recently emerging trend of Internet-based software systems is “ resource adaptive ,” i.e., software systems should be robust and intelligent enough to the changes of heterogeneous resources, both physical and logical, provided by their running environment. To keep pace of such a trend, we argue that some considerations should be taken into account for the future operating system design and implementation. From the structural perspective, rather than the “monolithic OS” that manages the aggregated resources on the single machine, the OS should be dynamically composed over the distributed resources and flexibly adapt to the resource and environment changes. Meanwhile, the OS should leverage advanced machine/deep learning techniques to derive configurations and policies and automatically learn to tune itself and schedule resources. This article envisions our recent thinking of the new OS abstraction, namely, ServiceOS , for future resource-adaptive intelligent software systems. The idea of ServiceOS is inspired by the delivery model of “ Software-as-a-Service ” that is supported by the Service-Oriented Architecture (SOA). The key principle of ServiceOS is based on resource disaggregation, resource provisioning as a service, and learning-based resource scheduling and allocation. The major goal of this article is not providing an immediately deployable OS. Instead, we aim to summarize the challenges and potentially promising opportunities and try to provide some practical implications for researchers and practitioners. 
    more » « less