skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 6, 2026

Title: Resilient execution of distributed X-ray image analysis workflows
Long-running scientific workflows, such as tomographic data analysis pipelines, are prone to a variety of failures, including hardware and network disruptions, as well as software errors. These failures can substantially degrade performance and increase turnaround times, particularly in large-scale, geographically distributed, and time-sensitive environments like synchrotron radiation facilities. In this work, we propose and evaluate resilience strategies aimed at mitigating the impact of failures in tomographic reconstruction workflows. Specifically, we introduce an asynchronous, non-blocking checkpointing mechanism and a dynamic load redistribution technique with lazy recovery, designed to enhance workflow reliability and minimize failure-induced overheads. These approaches facilitate progress preservation, balanced load distribution, and efficient recovery in error-prone environments. To evaluate their effectiveness, we implement a 3D tomographic reconstruction pipeline and deploy it across Argonne's leadership computing infrastructure and synchrotron facilities. Our results demonstrate that the proposed resilience techniques significantly reduce failure impact—by up to 500× —while maintaining negligible overhead (<3%).  more » « less
Award ID(s):
2411386
PAR ID:
10637362
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Frontiers
Date Published:
Journal Name:
Frontiers in High Performance Computing
Volume:
3
ISSN:
2813-7337
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean times between failures (MTBF) of system components. In- situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this paper, we present CoREC, a scalable resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. The paper also presents optimizations for load balancing and conflict avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on Titan at ORNL, and present an experimental evaluation in the paper. The experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales. 
    more » « less
  2. In IoT deployments, it is often necessary to replicate data in failure-prone and resource-constrained computing environments to meet the data availability requirements of smart applications. In this paper, we evaluate the impact of correlated failures on an off-the-shelf probabilistic replica placement strategy for IoT systems via trace-driven simulation. We extend this strategy to handle both correlated failures as well as resource scarcity by estimating the amount of storage capacity required to meet data availability requirements. These advancements lay the foundation for building computing systems that are capable of handling the unique challenge of reliable data access in low-resource environments. 
    more » « less
  3. By processing sensory data in the vicinity of its generation, edge computing reduces latency, improves responsiveness, and saves network bandwidth in data-intensive applications. However, existing edge computing solutions operate under the assumption that the edge infrastructure will comprise a set of pre-deployed, custom-configured computing devices, connected by a reliable local network. Although edge computing has great potential to provision the necessary computational resources in highly dynamic and volatile environments, including disaster recovery scenes and wilderness expeditions, extant distributed system architectures in this domain are not resilient against partial failure, caused by network disconnections. In this paper, we present a novel edge computing system architecture that delivers failure-resistant and efficient applications by dynamically adapting to handle failures; if the edge server becomes unreachable, device clusters start executing the assigned tasks by communicating P2P, until the edge server becomes reachable again. Our experimental results with the reference implementation show high responsiveness and resilience in the face of partial failure. These results indicate that the presented solution can integrate the individual capacities of mobile devices into powerful edge clouds, providing efficient and reliable services for end-users in highly dynamic and volatile environments. 
    more » « less
  4. Post-disaster reconnaissance is vital for assessing the impact of a natural disaster on the built environment and informing improvements in design, construction, risk mitigation, and our understanding of extreme events. The data obtained from reconnaissance can also be utilized to improve disaster recovery planning by maximizing resource efficiency, minimizing waste, and promoting resilience in future disasters. This paper aims to investigate existing reconnaissance reports and datasets to identify the factors that impact the reusability of buildings post-disaster and to recommend strategies that align with circular economy goals. The study adopted a three-step research methodology to attain the proposed goals: (1) thematic analysis was used to evaluate types of damages reported in the reconnaissance reports; (2) a supervised machine-learning algorithm was employed to analyze reconnaissance datasets; and (3) a concept map was developed based on interviews of 109 stakeholders in disaster-prone communities to recommend strategies to adopt circular economy practices post-disaster. The study results highlight the recurring risks of damage to different parts of the building and how circular economy resilience practices like deconstruction can minimize waste and maximize resource efficiency during post-disaster recovery. The findings of the study promote a more regenerative economy to build resilience to the challenges of future extreme weather events. 
    more » « less
  5. Abstract This article analyzes the role of dynamic economic resilience in relation to recovery from disasters in general and illustrates its potential to reduce disaster losses in a case study of the Wenchuan earthquake of 2008. We first offer operational definitions of the concept linked to policies to promote increased levels and speed of investment in repair and reconstruction to implement this resilience. We then develop a dynamic computable general equilibrium (CGE) model that incorporates major features of investment and traces the time‐path of the economy as it recovers with and without dynamic economic resilience. The results indicate that resilience strategies could have significantly reduced GDP losses from the Wenchuan earthquake by 47.4% during 2008–2011 by accelerating the pace of recovery and could have further reduced losses slightly by shortening the recovery by one year. The results can be generalized to conclude that shortening the recovery period is not nearly as effective as increasing reconstruction investment levels and steepening the time‐path of recovery. This is an important distinction that should be made in the typically vague and singular reference to increasing the speed of recovery in many definitions of dynamic resilience. 
    more » « less