skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Quantum Computing in the Cloud: Analyzing job and machine characteristics
As the popularity of quantum computing continues to grow, quantum machine access over the cloud is critical to both academic and industry researchers across the globe. And as cloud quantum computing demands increase exponentially, the analysis of resource consumption and execution characteristics are key to efficient management of jobs and resources at both the vendor-end as well as the client-end. While the analysis of resource consumption and management are popular in the classical HPC domain, it is severely lacking for more nascent technology like quantum computing. This paper is a first-of-its-kind academic study, analyzing various trends in job execution and resources consumption / utilization on quantum cloud systems. We focus on IBM Quantum systems and analyze characteristics over a two year period, encompassing over 6000 jobs which contain over 600,000 quantum circuit executions and correspond to almost 10 billion “shots” or trials over 20+ quantum machines. Specifically, we analyze trends focused on, but not limited to, execution times on quantum machines, queuing/waiting times in the cloud, circuit compilation times, machine utilization, as well as the impact of job and machine characteristics on all of these trends. Our analysis identifies several similarities and differences with classical HPC cloud systems. Based on our insights, we make recommendations and contributions to improve the management of resources and jobs on future quantum cloud systems.  more » « less
Award ID(s):
1730449 2110860 2016136 1818914
PAR ID:
10313645
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2021 IEEE International Symposium on Workload Characterization (IISWC)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As the popularity of quantum computing continues to grow, efficient quantum machine access over the cloud is critical to both academic and industry researchers across the globe. And as cloud quantum computing demands increase exponentially, the analysis of resource consumption and execution characteristics are key to efficient management of jobs and resources at both the vendor-end as well as the client-end. While the analysis and optimization of job / resource consumption and management are popular in the classical HPC domain, it is severely lacking for more nascent technology like quantum computing.This paper proposes optimized adaptive job scheduling to the quantum cloud taking note of primary characteristics such as queuing times and fidelity trends across machines, as well as other characteristics such as quality of service guarantees and machine calibration constraints. Key components of the proposal include a) a prediction model which predicts fidelity trends across machine based on compiled circuit features such as circuit depth and different forms of errors, as well as b) queuing time prediction for each machine based on execution time estimations.Overall, this proposal is evaluated on simulated IBM machines across a diverse set of quantum applications and system loading scenarios, and is able to reduce wait times by over 3x and improve fidelity by over 40% on specific usecases, when compared to traditional job schedulers. 
    more » « less
  2. As the popularity of quantum computing continues to grow, efficient quantum machine access over the cloud is critical to both academic and industry researchers across the globe. And as cloud quantum computing demands increase exponentially, the analysis of resource consumption and execution characteristics are key to efficient management of jobs and resources at both the vendor-end as well as the client-end. While the analysis and optimization of job / resource consumption and management are popular in the classical HPC domain, it is severely lacking for more nascent technology like quantum computing.This paper proposes optimized adaptive job scheduling to the quantum cloud taking note of primary characteristics such as queuing times and fidelity trends across machines, as well as other characteristics such as quality of service guarantees and machine calibration constraints. Key components of the proposal include a) a prediction model which predicts fidelity trends across machine based on compiled circuit features such as circuit depth and different forms of errors, as well as b) queuing time prediction for each machine based on execution time estimations.Overall, this proposal is evaluated on simulated IBM machines across a diverse set of quantum applications and system loading scenarios, and is able to reduce wait times by over 3x and improve fidelity by over 40% on specific usecases, when compared to traditional job schedulers. 
    more » « less
  3. Grid Engine is a Distributed Resource Manager (DRM), that manages the resources of distributed systems (such as Grid, HPC, or Cloud systems) and executes designated jobs which have requested to occupy or consume those resources. Grid Engine applies scheduling policies to allocate resources for jobs while simultaneously attempting to maintain optimal utilization of all machines in the distributed system. However, due to the complexity of Grid Engine's job submission commands and complicated resource management policies, the number of faulty job submissions in data centers increases with the number of jobs being submitted. To combat the increase in faulty jobs, Grid Engine allows administrators to design and implement Job Submission Verifiers (JSV) to verify jobs before they enter into Grid Engine. In this paper, we will discuss a Job Submission Verifier that was designed and implemented for Univa Grid Engine, a commercial version of Grid Engine, and thoroughly evaluated at the High Performance Computing Center of Texas Tech University. Our newly developed JSV communicates with Univa Grid Engine (UGE) components to verify whether a submitted job should be accepted as is, or modified then accepted, or rejected due to improper requests for resources. It had a substantial positive impact on reducing the number of faulty jobs submitted to UGE by far. For instance, it corrected 28.6% of job submissions and rejected 0.3% of total jobs from September 2018 to February 2019, that may otherwise lead to long or infinite waiting time in the job queue. 
    more » « less
  4. null (Ed.)
    High-throughput computing (HTC) workloads seek to complete as many jobs as possible over a long period of time. Such workloads require efficient execution of many parallel jobs and can occupy a large number of resources for a longtime. As a result, full utilization is the normal state of an HTC facility. The widespread use of container orchestrators eases the deployment of HTC frameworks across different platforms,which also provides an opportunity to scale up HTC workloads with almost infinite resources on the public cloud. However, the autoscaling mechanisms of container orchestrators are primarily designed to support latency-sensitive microservices, and result in unexpected behavior when presented with HTC workloads. In this paper, we design a feedback autoscaler, High Throughput Autoscaler (HTA), that leverages the unique characteristics ofthe HTC workload to autoscales the resource pools used by HTC workloads on container orchestrators. HTA takes into account a reference input, the real-time status of the jobs’ queue, as well as two feedback inputs, resource consumption of jobs, and the resource initialization time of the container orchestrator. We implement HTA using the Makeflow workload manager, WorkQueue job scheduler, and the Kubernetes cluster manager. We evaluate its performance on both CPU-bound and IO-bound workloads. The evaluation results show that, by using HTA, we improve resource utilization by 5.6×with a slight increase in execution time (about 15%) for a CPU-bound workload, and shorten the workload execution time by up to 3.65×for an IO-bound workload. 
    more » « less
  5. High-performance computing (HPC) resources are used for compute-demanding calculations in various fields of science and engineering. They are large computational facilities utilized by many users simultaneously. High utilization often leads to high waiting times. Simulating users' behavior on such a system can help with future system design, develop user interventions, and ultimately improve the user’s experience and resource utilization. Here, we present HPCMod, an Agent-Based Modeling Framework for Modeling Users on HPC Resources. The key concept of the framework is the representation of the user's computational needs: the user project is represented as a collection of possibly dependent compute tasks. Each task can be executed as a single compute job or a series of jobs, depending on the task size. Some tasks can be too big to be executed in one chunk; such a situation often occurs during molecular dynamics simulation. There are multiple ways in which tasks can be split into jobs, and users will make their decisions based on previous experience, application parallel scalability, and available resources. For example, a user's compute task requires 32 node hours; it can be executed in multiple ways: a single 32-hour job on one node, two sequential 16-hour jobs on one node, one 16-hour job on two nodes, and so on. In the HPCMod, we implemented three models: 1) historical replay of compute jobs, 2) simulation of reconstituted compute tasks using historical job sizes, and 3) adaptive compute tasks splitting where users can modify jobs parameters given available resources till the execution of the next job in line. The framework was tested on a ten-node test system and a larger 1,736-node system modeled after a portion of TACC Stampede-2. The HPC resource model implements a first in first out (FIFO) scheduler with backfill scheduling. The initial results showed that on a tiny system, adaptive task-splitting is beneficial for the user but leads to a larger number of jobs. On a large system, the adaptive task-splitting was also very beneficial, decreasing waiting times for users using this strategy almost two times; however, other users got a 5% increase in their wait time. Further investigation is needed as the current task reconstitution algorithm is deterministic and does not allow quantification of job recombination uncertainties. The Julia-based implementation is fast: five years of historic workflow consisting of a million jobs and a one-hour stepping took around three minutes. 
    more » « less