NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SuperSONIC: Cloud-Native Infrastructure for ML Inferencing

https://doi.org/10.1145/3708035.3736049

Kondratyev, Dmitry; Riedel, Benedikt; Chou, Yuan-Tang; Cochran-Branson, Miles; Paladino, Noah; Schultz, David; Liu, Mia; Duarte, Javier; Harris, Philip; Hsu, Shih-Chieh (July 2025, ACM)

The increasing computational demand from growing data rates and complex machine learning (ML) algorithms in large-scale scientific experiments has driven the adoption of the Services for Optimized Network Inference on Coprocessors (SONIC) approach. SONIC accelerates ML inference by offloading it to local or remote coprocessors to optimize resource utilization. Leveraging its portability to different types of coprocessors, SONIC enhances data processing and model deployment efficiency for cutting-edge research in high energy physics (HEP) and multi-messenger astrophysics (MMA). We developed the SuperSONIC project, a scalable server infrastructure for SONIC, enabling the deployment of computationally intensive tasks to Kubernetes clusters equipped with graphics processing units (GPUs). Using NVIDIA Triton Inference Server, SuperSONIC decouples client workflows from server infrastructure, standardizing communication, optimizing throughput, load balancing, and monitoring. SuperSONIC has been successfully deployed for the CMS and ATLAS experiments at the CERN Large Hadron Collider (LHC), the IceCube Neutrino Observatory (IceCube), and the Laser Interferometer Gravitational-Wave Observatory (LIGO) and tested on Kubernetes clusters at Purdue University, the National Research Platform (NRP), and the University of Chicago. SuperSONIC addresses the challenges of the Cloud-native era by providing a reusable, configurable framework that enhances the efficiency of accelerator-based inference deployment across diverse scientific domains and industries.
more » « less
Free, publicly-accessible full text available July 18, 2026
Event Workflow Management System: A Robust Technique for Massively Divisible and Distributed Workflows

https://doi.org/10.1145/3708035.3736051

Evans-Jacquez, Eric; Aydemir, Brian; Bockelman, Brian; Livny, Miron; Riedel, Benedikt; Ross, Ian; Schultz, David (July 2025, ACM)

Batch systems face issues with workloads comprising millions of tasks with short runtimes—scheduling is most efficient for long-running jobs. In addition, the nature of heterogeneous computing systems makes task bundling impractical. Building on HTCondor, the Event Workflow Management System (EWMS) provides an efficient solution to thrive with both paradigms, while featuring user-friendly and self-healing principles. Here, we describe this method, its implementation, and a real-world application.
more » « less
Free, publicly-accessible full text available July 18, 2026
IceCube SkyDriver – A SaaS Solution for Event Reconstruction using the Skymap Scanner

https://doi.org/10.1051/epjconf/202429504023

Evans-Jacquez, Eric; Schultz, David; Bockelman, Brian; Lincetto, Massimiliano; Livney, Miron; Riedel, Benedikt; Yuan, Tianlu (May 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. To accurately and promptly reconstruct the arrival direction of candidate neutrino events for Multi-Messenger Astrophysics use cases, IceCube employs Skymap Scanner workflows managed by the SkyDriver service. The Skymap Scanner performs maximum-likelihood tests on individual pixels generated from the Hierarchical Equal Area isoLatitude Pixelation (HEALPix) algorithm. Each test is computationally independent, which allows for massive parallelization. This workload is distributed using the Event Workflow Management System (EWMS)—a message-based workflow management system designed to scale to trillions of pixels per day. SkyDriver orchestrates multiple distinct Skymap Scanner workflows behind a REST interface, providing an easy-to-use reconstruction service for real-time candidate, cataloged, and simulated events. Here, we outline the SkyDriver service technique and the initial development of EWMS.
more » « less
Full Text Available
IceCube experience using XRootD-based Origins with GPU workflows in PNRP

https://doi.org/10.1051/epjconf/202429511011

Schultz, David; Sfiligoi, Igor; Riedel, Benedikt; Andrijauskas, Fabio; Weitzel, Derek; Würthwein, Frank (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. Understanding detector systematic effects is a continuous process. This requires the Monte Carlo simulation to be updated periodically to quantify potential changes and improvements in science results with more detailed modeling of the systematic effects. IceCube’s largest systematic effect comes from the optical properties of the ice the detector is embedded in. Over the last few years there have been considerable improvements in the understanding of the ice, which require a significant processing campaign to update the simulation. IceCube normally stores the results in a central storage system at the University of Wisconsin–Madison, but it ran out of disk space in 2022. The Prototype National Research Platform (PNRP) project thus offered to provide both GPU compute and storage capacity to IceCube in support of this activity. The storage access was provided via XRootD-based OSDF Origins, a first for IceCube computing. We report on the overall experience using PNRP resources, with both successes and pain points.
more » « less
Full Text Available
The anachronism of whole-GPU accounting

https://doi.org/10.1145/3491418.3535125

Sfiligoi, Igor; Schultz, David; Würthwein, Frank; Riedel, Benedikt; Mishin, Dmitry (July 2022, Practice and Experience in Advanced Research Computing)

Full Text Available
Expanding IceCube GPU computing into the Clouds

https://doi.org/10.1109/eScience51609.2021.00034

Sfiligoi, Igor; Smallen, Shava; Wurthwein, Frank; Wolter, Nicole; Schultz, David; Riedel, Benedikt (September 2021, 2021 IEEE 17th International Conference on eScience (eScience))

Full Text Available
Pushing the Cloud Limits in Support of IceCube Science

https://doi.org/10.1109/MIC.2020.3045209

Sfiligoi, Igor; Schultz, David; Wurthwein, Frank; Riedel, Benedikt; Deelman, Ewa (January 2021, IEEE Internet Computing)
null (Ed.)
Full Text Available
Running a Pre-exascale, Geographically Distributed, Multi-cloud Scientific Simulation

https://doi.org/10.1007/978-3-030-50743-5_2

Sfiligoi, Igor; Würthwein, Frank; Riedel, Benedikt; Schultz, David (June 2020, ISC High Performance 2020)
Sadayappan, Ponnuswamy; Chamberlain, Bradford L.; Juckeland, Guido; Ltaief, Hatem (Ed.)
As we approach the Exascale era, it is important to verify that the existing frameworks and tools will still work at that scale. Moreover, public Cloud computing has been emerging as a viable solution for both prototyping and urgent computing. Using the elasticity of the Cloud, we have thus put in place a pre-exascale HTCondor setup for running a scientific simulation in the Cloud, with the chosen application being IceCube's photon propagation simulation. I.e. this was not a purely demonstration run, but it was also used to produce valuable and much needed scientific results for the IceCube collaboration. In order to reach the desired scale, we aggregated GPU resources across 8 GPU models from many geographic regions across Amazon Web Services, Microsoft Azure, and the Google Cloud Platform. Using this setup, we reached a peak of over 51k GPUs corresponding to almost 380 PFLOP32s, for a total integrated compute of about 100k GPU hours. In this paper we provide the description of the setup, the problems that were discovered and overcome, as well as a short description of the actual science output of the exercise.
more » « less
Full Text Available
Managing Cloud networking costs for data-intensive applications by provisioning dedicated network links

https://doi.org/10.1145/3437359.3465563

Sfiligoi, Igor; Hare, Michael; Schultz, David; Würthwein, Frank; Riedel, Benedikt; Hutton, Tom; Barnet, Steve; Brik, Vladimir (January 2021, arXiv preprint arXiv:2104.06913)
null (Ed.)
Full Text Available
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scientific Computing: Producing a fp32 ExaFLOP hour worth of IceCube simulation data in a single workday

https://doi.org/10.1145/3311790.3396625

Sfiligoi, Igor; Schultz, David; Riedel, Benedikt; Wuerthwein, Frank; Barnet, Steve; Brik, Vladimir (July 2020, PEARC '20: Practice and Experience in Advanced Research Computing)
null (Ed.)
Scientific computing needs are growing dramatically with time and are expanding in science domains that were previously not compute intensive. When compute workflows spike well in excess of the capacity of their local compute resource, capacity should be temporarily provisioned from somewhere else to both meet deadlines and to increase scientific output. Public Clouds have become an attractive option due to their ability to be provisioned with minimal advance notice. The available capacity of cost-effective instances is not well understood. This paper presents expanding the IceCube's production HTCondor pool using cost-effective GPU instances in preemptible mode gathered from the three major Cloud providers, namely Amazon Web Services, Microsoft Azure and the Google Cloud Platform. Using this setup, we sustained for a whole workday about 15k GPUs, corresponding to around 170 PFLOP32s, integrating over one EFLOP32 hour worth of science output for a price tag of about $60k. In this paper, we provide the reasoning behind Cloud instance selection, a description of the setup and an analysis of the provisioned resources, as well as a short description of the actual science output of the exercise.
more » « less
Full Text Available

« Prev Next »

Search for: All records