Summary Large scientific facilities provide researchers with instrumentation, data, and data products that can accelerate scientific discovery. However, increasing data volumes coupled with limited local computational power prevents researchers from taking full advantage of what these facilities can offer. Many researchers looked into using commercial and academic cyberinfrastructure (CI) to process these data. Nevertheless, there remains a disconnect between large facilities and CI that requires researchers to be actively part of the data processing cycle. The increasing complexity of CI and data scale necessitates new data delivery models, those that can autonomously integrate large‐scale scientific facilities and CI to deliver real‐time data and insights. In this paper, we present our initial efforts using the Ocean Observatories Initiative project as a use case. In particular, we present a subscription‐based data streaming service for data delivery that leverages the Apache Kafka data streaming platform. We also show how our solution can automatically integrate large‐scale facilities with CI services for automated data processing.
more »
« less
Supporting Data-Driven Workflows Enabled by Large Scale Observatories
Large scale observatories are shared-use resources that provide open access to data from geographically distributed sensors and instruments. This data has the potential to accelerate scientific discovery. However, seamlessly integrating the data into scientific workflows remains a challenge. In this paper, we summarize our ongoing work in supporting data-driven and data-intensive workflows and outline our vision for how these observatories can improve large-scale science. Specifically, we present programming abstractions and runtime management services to enable the automatic integration of data in scientific workflows. Further, we show how approximation techniques can be used to address network and processing variations by studying constraint limitations and their associated latencies. We use the Ocean Observatories Initiative (OOI) as a driving use case for this work.
more »
« less
- PAR ID:
- 10077376
- Date Published:
- Journal Name:
- 2017 IEEE 13th International Conference on e-Science (e-Science)
- Page Range / eLocation ID:
- 592 to 595
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Scientific simulation workflows executing on very large scale computing systems are essential modalities for scientific investigation. The increasing scales and resolution of these simulations provide new opportunities for accurately modeling complex natural and engineered phenomena. However, the increasing complexity necessitates managing, transporting, and processing unprecedented amounts of data, and as a result, researchers are increasingly exploring data-staging and in-situ workflows to reduce data movement and data-related overheads. However, as these workflows become more dynamic in their structures and behaviors, data staging and in-situ solutions must evolve to support new requirements. In this paper, we explore how the service-oriented concept can be applied to extreme-scale in-situ workflows. Specifically, we explore persistent data staging as a service and present the design and implementation of DataSpaces as a Service, a service-oriented data staging framework. We use a dynamically coupled fusion simulation workflow to illustrate the capabilities of this framework and evaluate its performance and scalability.more » « less
-
Large scientific facilities are unique and complex infrastructures that have become fundamental instruments for enabling high quality, world-leading research to tackle scientific problems at unprecedented scales. Cyberinfrastructure (CI) is an essential component of these facilities, providing the user community with access to data, data products, and services with the potential to transform data into knowledge. However, the timely evolution of the CI available at large facilities is challenging and can result in science communities requirements not being fully satisfied. Furthermore, integrating CI across multiple facilities as part of a scientific workflow is hard, resulting in data silos. In this paper, we explore how science gateways can provide improved user experiences and services that may not be offered at large facility datacenters. Using a science gateway supported by the Science Gateway Community Institute, which provides subscription-based delivery of streamed data and data products from the NSF Ocean Observatories Initiative (OOI), we propose a system that enables streaming-based capabilities and workflows using data from large facilities, such as the OOI, in a scalable manner. We leverage data infrastructure building blocks, such as the Virtual Data Collaboratory, which provides data and comput- ing capabilities in the continuum to efficiently and collaboratively integrate multiple data-centric CIs, build data-driven workflows, and connect large facilities data sources with NSF-funded CI, such as XSEDE. We also introduce architectural solutions for running these workflows using dynamically provisioned federated CI.more » « less
-
In this paper, we describe how we extended the Pegasus Workflow Management System to support edge-to-cloud workflows in an automated fashion. We discuss how Pegasus and HTCondor (its job scheduler) work together to enable this automation. We use HTCondor to form heterogeneous pools of compute resources and Pegasus to plan the workflow onto these resources and manage containers and data movement for executing workflows in hybrid edge-cloud environments. We then show how Pegasus can be used to evaluate the execution of workflows running on edge only, cloud only, and edge-cloud hybrid environments. Using the Chameleon Cloud testbed to set up and configure an edge-cloud environment, we use Pegasus to benchmark the executions of one synthetic workflow and two production workflows: CASA-Wind and the Ocean Observatories Initiative Orcasound workflow, all of which derive their data from edge devices. We present the performance impact on workflow runs of job and data placement strategies employed by Pegasus when configured to run in the above three execution environments. Results show that the synthetic workflow performs best in an edge only environment, while the CASA - Wind and Orcasound workflows see significant improvements in overall makespan when run in a cloud only environment. The results demonstrate that Pegasus can be used to automate edge-to-cloud science workflows and the workflow provenance data collection capabilities of the Pegasus monitoring daemon enable computer scientists to conduct edge-to-cloud research.more » « less
-
Scientific workflows drive most modern large-scale science breakthroughs by allowing scientists to define their computations as a set of jobs executed in a given order based on their data dependencies. Workflow management systems (WMSs) have become key to automating scientific workflows-executing computational jobs and orchestrating data transfers between those jobs running on complex high-performance computing (HPC) platforms. Traditionally, WMSs use files to communicate between jobs: a job writes out files that are read by other jobs. However, HPC machines face a growing gap between their storage and compute capabilities. To address that concern, the scientific community has adopted a new approach called in situ, which bypasses costly parallel filesystem I/O operations with faster in-memory or in-network communications. When using in situ approaches, communication and computations can be interleaved. In this work, we leverage the Decaf in situ dataflow framework to accelerate task-based scientific workflows managed by the Pegasus WMS, by replacing file communications with faster MPI messaging. We propose a new execution engine that uses Decaf to manage communications within a sub-workflow (i.e., set of jobs) to optimize inter-job communications. We consider two workflows in this study: (i) a synthetic workflow that benchmarks and compares file- and MPI-based communication; and (ii) a realistic bioinformatics workflow that computes mu-tational overlaps in the human genome. Experiments show that in situ communication can improve the bioinformatics workflow execution time by 22% to 30% compared with file communication. Our results motivate further opportunities and challenges for bridging traditional WMSs with in situ frameworks.more » « less
An official website of the United States government

