skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Developing Distributed High-performance Computing Capabilities of an Open Science Platform for Robust Epidemic Analysis
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among domain experts, mathematical modelers, and scientific computing specialists. Computationally, however, it also revealed critical gaps in the ability of researchers to exploit advanced computing systems. These challenging areas include gaining access to scalable computing systems, porting models and workflows to new systems, sharing data of varying sizes, and producing results that can be reproduced and validated by others. Informed by our team’s work in supporting public health decision makers during the COVID-19 pandemic and by the identified capability gaps in applying high-performance computing (HPC) to the modeling of complex social systems, we present the goals, requirements, and initial implementation of OSPREY, an open science platform for robust epidemic analysis. The prototype implementation demonstrates an integrated, algorithm-driven HPC workflow architecture, coordinating tasks across federated HPC resources, with robust, secure and automated access to each of the resources. We demonstrate scalable and fault-tolerant task execution, an asynchronous API to support fast time-to-solution algorithms, an inclusive, multi-language approach, and efficient wide-area data management. The example OSPREY code is made available on a public repository.  more » « less
Award ID(s):
2200234
PAR ID:
10442387
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Page Range / eLocation ID:
868 to 877
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Yang, Chaowei (Ed.)
    In response to the soaring needs of human mobility data, especially during disaster events such as the COVID-19 pandemic, and the associated big data challenges, we develop a scalable online platform for extracting, analyzing, and sharing multi-source multi-scale human mobility flows. Within the platform, an origin-destination-time (ODT) data model is proposed to work with scalable query engines to handle heterogenous mobility data in large volumes with extensive spatial coverage, which allows for efficient extraction, query, and aggregation of billion-level origin-destination (OD) flows in parallel at the server-side. An interactive spatial web portal, ODT Flow Explorer, is developed to allow users to explore multi-source mobility datasets with user-defined spatiotemporal scales. To promote reproducibility and replicability, we further develop ODT Flow REST APIs that provide researchers with the flexibility to access the data programmatically via workflows, codes, and programs. Demonstrations are provided to illustrate the potential of the APIs integrating with scientific workflows and with the Jupyter Notebook environment. We believe the platform coupled with the derived multi-scale mobility data can assist human mobility monitoring and analysis during disaster events such as the ongoing COVID-19 pandemic and benefit both scientific communities and the general public in understanding human mobility dynamics. 
    more » « less
  2. null (Ed.)
    Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions. 
    more » « less
  3. null (Ed.)
    The digital divide—and, in particular, the homework gap— have been exacerbated by the COVID-19 pandemic, laying bare not only the inequities in broadband Internet access but also how these inequities ultimately affect citizens’ ability to learn, work, and play. Addressing these inequities ultimately requires having holistic, “full stack” data on the nature of the gaps in infrastructure and uptake—from the physical infrastructure (e.g., fiber, cable) to speed and application performance to affordability and neighborhood effects that ultimately affect whether a technology is adopted. This paper surveys how various existing datasets can (and cannot) shed light on these gaps, the limitations of these datasets, what we know from existing data about how the Internet responded to shifts in traffic during COVID-19, and—importantly for the future—what data we need to better understand these problems moving forward and how the research community, policymakers, and the public might gain access to various data. Keywords: digital divide,iInternet, mapping, performance 
    more » « less
  4. null (Ed.)
    We describe the design motivation, architecture, deployment, and early operations of Expanse, a 5 Petaflop, heterogenous HPC system that entered production as an NSF-funded resource in December 2020 and will be operated on behalf of the national community for five years. Expanse will serve a broad range of computational science and engineering through a combination of standard batch-oriented services, and by extending the system to the broader CI ecosystem through science gateways, public cloud integration, support for high throughput computing, and composable systems. Expanse was procured, deployed, and put into production entirely during the COVID-19 pandemic, adhering to stringent public health guidelines throughout. Nevertheless, the planned production date of October 1, 2020 slipped by only two months, thanks to thorough planning, a dedicated team of technical and administrative experts, collaborative vendor partnerships, and a commitment to getting an important national computing resource to the community at a time of great need. 
    more » « less
  5. null (Ed.)
    The COVID-19 public health emergency caused widespread economic shutdown and unemployment. The resulting surge in Unemployment Insurance claims threatened to overwhelm the legacy systems state workforce agencies rely on to collect, process, and pay claims. In Rhode Island, we developed a scalable cloud solution to collect Pandemic Unemployment Assistance claims as part of a new program created under the Coronavirus Aid, Relief and Economic Security Act to extend unemployment benefits to independent contractors and gig-economy workers not covered by traditional Unemployment Insurance. Our new system was developed, tested, and deployed within 10 days following the passage of the Coronavirus Aid, Relief and Economic Security Act, making Rhode Island the first state in the nation to collect, validate, and pay Pandemic Unemployment Assistance claims. A cloud-enhanced interactive voice response system was deployed a week later to handle the corresponding surge in weekly certifications for continuing unemployment benefits. Cloud solutions can augment legacy systems by offloading processes that are more efficiently handled in modern scalable systems, reserving the limited resources of legacy systems for what they were originally designed. This agile use of combined technologies allowed Rhode Island to deliver timely Pandemic Unemployment Assistance benefits with an estimated cost savings of $502,000 (representing a 411% return on investment). 
    more » « less