NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Event Workflow Management System: A Robust Technique for Massively Divisible and Distributed Workflows

https://doi.org/10.1145/3708035.3736051

Evans-Jacquez, Eric; Aydemir, Brian; Bockelman, Brian; Livny, Miron; Riedel, Benedikt; Ross, Ian; Schultz, David (July 2025, ACM)

Batch systems face issues with workloads comprising millions of tasks with short runtimes—scheduling is most efficient for long-running jobs. In addition, the nature of heterogeneous computing systems makes task bundling impractical. Building on HTCondor, the Event Workflow Management System (EWMS) provides an efficient solution to thrive with both paradigms, while featuring user-friendly and self-healing principles. Here, we describe this method, its implementation, and a real-world application.
more » « less
Free, publicly-accessible full text available July 18, 2026
IceCube SkyDriver – A SaaS Solution for Event Reconstruction using the Skymap Scanner

https://doi.org/10.1051/epjconf/202429504023

Evans-Jacquez, Eric; Schultz, David; Bockelman, Brian; Lincetto, Massimiliano; Livney, Miron; Riedel, Benedikt; Yuan, Tianlu (May 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. To accurately and promptly reconstruct the arrival direction of candidate neutrino events for Multi-Messenger Astrophysics use cases, IceCube employs Skymap Scanner workflows managed by the SkyDriver service. The Skymap Scanner performs maximum-likelihood tests on individual pixels generated from the Hierarchical Equal Area isoLatitude Pixelation (HEALPix) algorithm. Each test is computationally independent, which allows for massive parallelization. This workload is distributed using the Event Workflow Management System (EWMS)—a message-based workflow management system designed to scale to trillions of pixels per day. SkyDriver orchestrates multiple distinct Skymap Scanner workflows behind a REST interface, providing an easy-to-use reconstruction service for real-time candidate, cataloged, and simulated events. Here, we outline the SkyDriver service technique and the initial development of EWMS.
more » « less
Full Text Available
Coffea-Casa: Building composable analysis facilities for the HL-LHC

https://doi.org/10.1051/epjconf/202429507009

Albin, Sam; Attebury, Garhan; Bloom, Kenneth; Bockelman, Brian; Lundstedt, Carl; Shadura, Oksana; Thiltges, John (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The large data volumes expected from the High Luminosity LHC (HL-LHC) present challenges to existing paradigms and facilities for end-user data analysis. Modern cyberinfrastructure tools provide a diverse set of services that can be composed into a system that provides physicists with powerful tools that give them straightforward access to large computing resources, with low barriers to entry. The Coffea-Casa analysis facility (AF) provides an environment for end users enabling the execution of increasingly complex analyses such as those demonstrated by the Analysis Grand Challenge (AGC) and capturing the features that physicists will need for the HL-LHC. We describe the development progress of the Coffea-Casa facility featuring its modularity while demonstrating the ability to port and customize the facility software stack to other locations. The facility also facilitates the support of batch systems while staying Kubernetes-native. We present the evolved architecture of the facility, such as the integration of advanced data delivery services (e.g. ServiceX) and making data caching services (e.g. XCache) available to end users of the facility. We also highlight the composability of modern cyberinfrastructure tools. To enable machine learning pipelines at coffee-casa analysis facilities, a set of industry ML solutions adopted for HEP columnar analysis were integrated on top of existing facility services. These services also feature transparent access for user workflows to GPUs available at a facility via inference servers while using Kubernetes as enabling technology.
more » « less
Full Text Available
CRIU - Checkpoint Restore in Userspace for computational simulations and scientific applications

https://doi.org/10.1051/epjconf/202429507046

Andrijauskas, Fabio; Sfiligoi, Igor; Davila, Diego; Arora, Aashay; Guiang, Jonathan; Bockelman, Brian; Thain, Greg; Würthwein, Frank (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing programmatically, it would be preferable if the batch scheduling system could do that independently. This paper evaluates the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool for the GNU/Linux environments, emphasizing the OSG’s OSPool HTCondor setup. CRIU allows checkpointing the process state into a disk image and can deal with both open files and established network connections seamlessly. Furthermore, it can checkpoint traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. However, some limitations prevent it from being usable in all circumstances.
more » « less
Full Text Available
Falcon: Fair and Efficient Online File Transfer Optimization

https://doi.org/10.1109/TPDS.2023.3282872

Arifuzzaman, Md; Bockelman, Brian; Basney, James; Arslan, Engin (January 2023, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
SciAuth: A Lightweight End-to-End Capability-Based Authorization Environment for Scientific Computing

https://doi.org/10.1145/3491418.3535160

Aydemir, Brian; Basney, Jim; Bockelman, Brian; Gaynor, Jeff; Weitzel; Derek (July 2022, Practice and Experience in Advanced Research Computing)

We introduce a new end-to-end software environment that enables experimentation with using SciTokens for capability-based authorization in scientific computing. This set of interconnected Docker containers enables science projects to gain experience with the SciTokens model prior to adoption. It is a product of our SciAuth project, which supports the adoption of the SciTokens model through community engagement, support for coordinated adoption of community standards, assistance with software integration, security analysis and threat modeling, training, and workforce development.
more » « less
Full Text Available
Cache management for large data transfers and multipath forwarding strategies in Named Data Networking

https://doi.org/10.1016/j.comnet.2021.108437

Alhowaidi, Mohammad; Nadig, Deepak; Hu, Boyang; Ramamurthy, Byrav; Bockelman, Brian (November 2021, Computer Networks)
null (Ed.)
Full Text Available
Harnessing HPC resources for CMS jobs using a Virtual Private Network

https://doi.org/10.1051/epjconf/202125102032

Tovar, Benjamin; Bockelman, Brian; Hildreth, Michael; Lannon, Kevin; Thain, Douglas (January 2021, EPJ Web of Conferences)
Biscarat, C.; Campana, S.; Hegner, B.; Roiser, S.; Rovelli, C.I.; Stewart, G.A. (Ed.)
The processing needs for the High Luminosity (HL) upgrade for the LHC require the CMS collaboration to harness the computational power available on non-CMS resources, such as High-Performance Computing centers (HPCs). These sites often limit the external network connectivity of their computational nodes. In this paper we describe a strategy in which all network connections of CMS jobs inside a facility are routed to a single point of external network connectivity using a Virtual Private Network (VPN) server by creating virtual network interfaces in the computational nodes. We show that when the computational nodes and the host running the VPN server have the namespaces capability enabled, the setup can run entirely on user space with no other root permissions required. The VPN server host may be a privileged node inside the facility configured for outside network access, or an external service that the nodes are allowed to contact. When namespaces are not enabled at the client side, then the setup falls back to using a SOCKS server instead of virtual network interfaces. We demonstrate the strategy by executing CMS Monte Carlo production requests on opportunistic non-CMS resources at the University of Notre Dame. For these jobs, cvmfs support is tested via fusermount (cvmfsexec), and the native fuse module.
more » « less
Full Text Available
Systematic benchmarking of HTTPS third party copy on 100Gbps links using XRootD

https://doi.org/10.1051/epjconf/202125102001

Fajardo, Edgar; Arora, Aashay; Davila, Diego; Gao, Richard; Würthwein, Frank; Bockelman, Brian (January 2021, EPJ Web of Conferences)
Biscarat, C.; Campana, S.; Hegner, B.; Roiser, S.; Rovelli, C.I.; Stewart, G.A. (Ed.)
The High Luminosity Large Hadron Collider provides a data challenge. The amount of data recorded from the experiments and transported to hundreds of sites will see a thirty fold increase in annual data volume. A systematic approach to contrast the performance of different Third Party Copy (TPC) transfer protocols arises. Two contenders, XRootD-HTTPS and the GridFTP are evaluated in their performance for transferring files from one server to another over 100Gbps interfaces. The benchmarking is done by scheduling pods on the Pacific Research Platform Kubernetes cluster to ensure reproducible and repeatable results. This opens a future pathway for network testing of any TPC transfer protocol.
more » « less
Full Text Available
Coffea-casa: an analysis facility prototype

https://doi.org/10.1051/epjconf/202125102061

Adamec, Matous; Attebury, Garhan; Bloom, Kenneth; Bockelman, Brian; Lundstedt, Carl; Shadura, Oksana; Thiltges, John (January 2021, EPJ Web of Conferences)
Biscarat, C.; Campana, S.; Hegner, B.; Roiser, S.; Rovelli, C.I.; Stewart, G.A. (Ed.)
Data analysis in HEP has often relied on batch systems and event loops; users are given a non-interactive interface to computing resources and consider data event-by-event. The “Coffea-casa” prototype analysis facility is an effort to provide users with alternate mechanisms to access computing resources and enable new programming paradigms. Instead of the command-line interface and asynchronous batch access, a notebook-based web interface and interactive computing is provided. Instead of writing event loops, the columnbased Coffea library is used. In this paper, we describe the architectural components of the facility, the services offered to end users, and how it integrates into a larger ecosystem for data access and authentication.
more » « less
Full Text Available

« Prev Next »

Search for: All records