skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing: Producing a fp32 ExaFLOP hour worth of IceCube simulation data in a single workday
Scientific computing needs are growing dramatically with time and are expanding in science domains that were previously not compute intensive. When compute workflows spike well in excess of the capacity of their local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. The available capacity of cost-effective instances is not well understood. This paper presents expanding the IceCube's production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.  more » « less
Award ID(s):
1841479 1826967 1941481 1841530 1148698 1730158
PAR ID:
10211900
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
PEARC '20: Practice and Experience in Advanced Research Computing
Page Range / eLocation ID:
85 to 90
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sadayappan, Ponnuswamy; Chamberlain, Bradford L.; Juckeland, Guido; Ltaief, Hatem (Ed.)
    As we approach the Exascale era, it is important to verify that the existing frameworks and tools will still work at that scale. Moreover, public Cloud computing has been emerging as a viable solution for both prototyping and urgent computing. Using the elasticity of the Cloud, we have thus put in place a pre-exascale HTCondor setup for running a scientific simulation in the Cloud, with the chosen application being IceCube's photon propagation simulation. I.e. this was not a purely demonstration run, but it was also used to produce valuable and much needed scientific results for the IceCube collaboration. In order to reach the desired scale, we aggregated GPU resources across 8 GPU models from many geographic regions across Amazon Web Services, Microsoft Azure, and the Google Cloud Platform. Using this setup, we reached a peak of over 51k GPUs corresponding to almost 380 PFLOP32s, for a total integrated compute of about 100k GPU hours. In this paper we provide the description of the setup, the problems that were discovered and overcome, as well as a short description of the actual science output of the exercise. 
    more » « less
  2. De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
    The OSG-operated Open Science Pool is an HTCondor-based virtual cluster that aggregates resources from compute clusters provided by several organizations. Most of the resources are not owned by OSG, so demand-based dynamic provisioning is important for maximizing usage without incurring excessive waste. OSG has long relied on GlideinWMS for most of its resource provisioning needs but is limited to resources that provide a Grid-compliant Compute Entrypoint. To work around this limitation, the OSG Software Team has developed a glidein container that resource providers could use to directly contribute to the OSPool. The problem with that approach is that it is not demand-driven, relegating it to backfill scenarios only. To address this limitation, a demand-driven direct provisioner of Kubernetes resources has been developed and successfully used on the NRP. The setup still relies on the OSG-maintained backfill container image but automates the provisioning matchmaking and successive requests. That provisioner has also been extended to support Lancium, a green computing cloud provider with a Kubernetes-like proprietary interface. The provisioner logic has been intentionally kept very simple, making this extension a low-cost project. Both NRP and Lancium resources have been provisioned exclusively using this mechanism for many months. 
    more » « less
  3. null (Ed.)
    Cloud providers offer instances with similar compute capabilities (for example, instances with different generations of GPUs like K80s, P100s, V100s) across many regions, availability zones, and on-demand and spot markets, with prices governed independently by individual supplies and demands. In this paper, using machine learning model training as an example application, we explore the potential cost reductions possible by leveraging this cross-cloud instance market. We present quantitative results on how the prices of cloud instances change with time, and how total costs can be decreased by considering this dynamic pricing market. Our preliminary experiments show that a) the optimal instance choice for a model is dependent on both the objective (e.g., cost, time, or combination) and the model’s performance characteristics, b) the cost of moving training jobs between instances is cheap, c) jobs do not need to be preempted more frequently than once a day to leverage the benefits from spot instance price variations, and d) the cost of training a model can be decreased by as much as 3.5× compared to a static policy. We also look at contexts where users specify higherlevel objectives over collections of jobs, show examples of policies for these contexts, and discuss additional challenges involved in making these cost reductions viable. 
    more » « less
  4. null (Ed.)
    In this paper we describe how high performance computing in the Google Cloud Platform can be utilized in an urgent and emergency situation to process large amounts of traffic data efficiently and on demand. Our approach provides a solution to an urgent need for disaster management using massive data processing and high performance computing. The traffic data used in this demonstration is collected from the public camera systems on Interstate highways in the Southeast United States. Our solution launches a parallel processing system that is the size of a Top 5 supercomputer using the Google Cloud Platform. Results show that the parallel processing system can be launched in a few hours, that it is effective at fast processing of high volume data, and can be de-provisioned in a few hours. We processed 211TB of video utilizing 6,227,593 core hours over the span of about eight hours with an average cost of around $0.008 per vCPU hour, which is less than the cost of many on-premise HPC systems. 
    more » « less
  5. To enable the sustainable use of their ocean resources, capacity for ocean science and observations is important for every coastal nation. In many developing areas of the world, capability for ocean science and observations is not yet adequate to meet management needs. International organizations have employed a variety of capacity development approaches to assist developing countries in building self-sustaining ocean science and observational communities. This article describes the lessons learned from visiting scientist programs conducted for more than a decade by the Partnership for Observation of the Global Ocean (POGO) and the Scientific Committee on Oceanic Research (SCOR) that dispatched ocean scientists to developing countries to train hundreds of individuals in a variety of ocean science and observation topics and techniques. From these programs, SCOR and POGO have learned that training in-country has multiple benefits to trainees, host institutions, and trainers, benefits that are not achievable when students leave their countries. These benefits include more cost-effective training on issues relevant to the host institutions using locally available technology, as well as the ability to reach a large number of trainees. Lessons learned from the POGO and SCOR programs can be used to inform the future capacity-development activities of POGO and SCOR, as well as other organizations, to improve, enhance, and expand the use of in-country training and mentoring. Such approaches could contribute to the capacity development efforts of the UN Decade of Ocean Science for Sustainable Development. 
    more » « less