skip to main content


This content will become publicly available on May 1, 2024

Title: Developing Distributed High-performance Computing Capabilities of an Open Science Platform for Robust Epidemic Analysis
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among domain experts, mathematical modelers, and scientific computing specialists. Computationally, however, it also revealed critical gaps in the ability of researchers to exploit advanced computing systems. These challenging areas include gaining access to scalable computing systems, porting models and workflows to new systems, sharing data of varying sizes, and producing results that can be reproduced and validated by others. Informed by our team’s work in supporting public health decision makers during the COVID-19 pandemic and by the identified capability gaps in applying high-performance computing (HPC) to the modeling of complex social systems, we present the goals, requirements, and initial implementation of OSPREY, an open science platform for robust epidemic analysis. The prototype implementation demonstrates an integrated, algorithm-driven HPC workflow architecture, coordinating tasks across federated HPC resources, with robust, secure and automated access to each of the resources. We demonstrate scalable and fault-tolerant task execution, an asynchronous API to support fast time-to-solution algorithms, an inclusive, multi-language approach, and efficient wide-area data management. The example OSPREY code is made available on a public repository.  more » « less
Award ID(s):
2200234
NSF-PAR ID:
10442387
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Page Range / eLocation ID:
868 to 877
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Yang, Chaowei (Ed.)
    In response to the soaring needs of human mobility data, especially during disaster events such as the COVID-19 pandemic, and the associated big data challenges, we develop a scalable online platform for extracting, analyzing, and sharing multi-source multi-scale human mobility flows. Within the platform, an origin-destination-time (ODT) data model is proposed to work with scalable query engines to handle heterogenous mobility data in large volumes with extensive spatial coverage, which allows for efficient extraction, query, and aggregation of billion-level origin-destination (OD) flows in parallel at the server-side. An interactive spatial web portal, ODT Flow Explorer, is developed to allow users to explore multi-source mobility datasets with user-defined spatiotemporal scales. To promote reproducibility and replicability, we further develop ODT Flow REST APIs that provide researchers with the flexibility to access the data programmatically via workflows, codes, and programs. Demonstrations are provided to illustrate the potential of the APIs integrating with scientific workflows and with the Jupyter Notebook environment. We believe the platform coupled with the derived multi-scale mobility data can assist human mobility monitoring and analysis during disaster events such as the ongoing COVID-19 pandemic and benefit both scientific communities and the general public in understanding human mobility dynamics. 
    more » « less
  2. null (Ed.)
    Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions. 
    more » « less
  3. Abstract Rapid and widespread testing of severe acute respiratory coronavirus 2 (SARS-CoV-2) is essential for an effective public health response aimed at containing and mitigating the coronavirus disease 2019 (COVID-19) pandemic. Successful health policy implementation relies on early identification of infected individuals and extensive contact tracing. However, rural communities, where resources for testing are sparse or simply absent, face distinctive challenges to achieving this success. Accordingly, we report the development of an academic, public land grant University laboratory-based detection assay for the identification of SARS-CoV-2 in samples from various clinical specimens that can be readily deployed in areas where access to testing is limited. The test, which is a quantitative reverse transcription polymerase chain reaction (RT-qPCR)-based procedure, was validated on samples provided by the state laboratory and submitted for FDA Emergency Use Authorization. Our test exhibits comparable sensitivity and exceeds specificity and inclusivity values compared to other molecular assays. Additionally, this test can be re-configured to meet supply chain shortages, modified for scale up demands, and is amenable to several clinical specimens. Test development also involved 3D engineering critical supplies and formulating a stable collection media that allowed samples to be transported for hours over a dispersed rural region without the need for a cold-chain. These two elements that were critical when shortages impacted testing and when personnel needed to reach areas that were geographically isolated from the testing center. Overall, using a robust, easy-to-adapt methodology, we show that an academic laboratory can supplement COVID-19 testing needs and help local health departments assess and manage outbreaks. This additional testing capacity is particularly germane for smaller cities and rural regions that would otherwise be unable to meet the testing demand. 
    more » « less
  4. The management of security credentials (e.g., passwords, secret keys) for computational science workflows is a burden for scientists and information security officers. Problems with credentials (e.g., expiration, privilege mismatch) cause workflows to fail to fetch needed input data or store valuable scientific results, distracting scientists from their research by requiring them to diagnose the problems, re-run their computations, and wait longer for their results. SciTokens introduces a capabilities-based authorization infrastructure for distributed scientific computing, to help scientists manage their security credentials more reliably and securely. SciTokens uses IETF-standard OAuth JSON Web Tokens for capability-based secure access to remote scientific data. These access tokens convey the specific authorizations needed by the workflows, rather than general-purpose authentication impersonation credentials, to address the risks of scientific workflows running on distributed infrastructure including NSF resources (e.g., LIGO Data Grid, Open Science Grid, XSEDE) and public clouds (e.g., Amazon Web Services, Google Cloud, Microsoft Azure). By improving the interoperability and security of scientific workflows, SciTokens 1) enables use of distributed computing for scientific domains that require greater data protection and 2) enables use of more widely distributed computing resources by reducing the risk of credential abuse on remote systems. In this extended abstract, we present the results over the past year of our open source implementation of the SciTokens model and its deployment in the Open Science Grid, including new OAuth support added in the HTCondor 8.8 release series. 
    more » « less
  5. null (Ed.)
    The COVID-19 public health emergency caused widespread economic shutdown and unemployment. The resulting surge in Unemployment Insurance claims threatened to overwhelm the legacy systems state workforce agencies rely on to collect, process, and pay claims. In Rhode Island, we developed a scalable cloud solution to collect Pandemic Unemployment Assistance claims as part of a new program created under the Coronavirus Aid, Relief and Economic Security Act to extend unemployment benefits to independent contractors and gig-economy workers not covered by traditional Unemployment Insurance. Our new system was developed, tested, and deployed within 10 days following the passage of the Coronavirus Aid, Relief and Economic Security Act, making Rhode Island the first state in the nation to collect, validate, and pay Pandemic Unemployment Assistance claims. A cloud-enhanced interactive voice response system was deployed a week later to handle the corresponding surge in weekly certifications for continuing unemployment benefits. Cloud solutions can augment legacy systems by offloading processes that are more efficiently handled in modern scalable systems, reserving the limited resources of legacy systems for what they were originally designed. This agile use of combined technologies allowed Rhode Island to deliver timely Pandemic Unemployment Assistance benefits with an estimated cost savings of $502,000 (representing a 411% return on investment). 
    more » « less