Many scientific applications are expressed as high-throughput workflows that consist of large graphs of data assets and tasks to be executed on large parallel and distributed systems. A chal- lenge in executing these workflows is managing data: both datasets and software must be efficiently distributed to cluster nodes; inter- mediate data must be conveyed between tasks; output data must be delivered to its destination. Scaling problems result when these actions are performed in an uncoordinated manner on a shared filesystem. To address this problem, we introduce TaskVine: a sys- tem for exploiting the aggregate local storage and network capacity of a large cluster. TaskVine tracks the lifetime of data in a workflow –from archival sources to final outputs– making use of local storage to distribute, and re-use data wherever possible. We describe the architecture and novel capabilities of TaskVine, and demonstrate its use with applications in genomics, high energy physics, molecular dynamics, and machine learning. 
                        more » 
                        « less   
                    
                            
                            Maximizing Data Utility for HPC Python Workflow Execution
                        
                    
    
            Large-scale HPC workflows are increasingly implemented in dy- namic languages such as Python, which allow for more rapid devel- opment than traditional techniques. However, the cost of executing Python applications at scale is often dominated by the distribution of common datasets and complex software dependencies. As the application scales up, data distribution becomes a limiting factor that prevents scaling beyond a few hundred nodes. To address this problem, we present the integration of Parsl (a Python-native paral- lel programming library) with TaskVine (a data-intensive workflow execution engine). Instead of relying on a shared filesystem to pro- vide data to tasks on demand, Parsl is able to express advance data needs to TaskVine, which then performs efficient data distribution at runtime. This combination provides a performance speedup of 1.48x over the typical method of on-demand paging from the shared filesystem, while also providing an average task speedup of 1.79x with 2048 tasks and 256 nodes. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1931348
- PAR ID:
- 10567833
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400707858
- Page Range / eLocation ID:
- 637 to 640
- Format(s):
- Medium: X
- Location:
- Denver CO USA
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            High energy physics experiments produce petabytes of data annually that must be reduced to gain insight into the laws of nature. Early-stage reduction executes long-running high-throughput workflows across thousands of nodes spanning multiple facilities to produce shared datasets. Later stages are typically written by individuals or small groups and must be refined and re-run many times for correctness. Reducing iteration times of later stages is key to accelerating discovery. We demonstrate our experience reshaping late-stage analysis applications on thousands of nodes. It is not enough merely to increase scale: it is necessary to make changes throughout the stack, including storage systems, data management, task scheduling, and application design. We demonstrate these changes when applied to two analysis applications built on open source data analysis frameworks (Coffea, Dask, TaskVine). We evaluate the performance of the applications on opportunistic campus clusters, showing effective scaling up to 7200 cores, thus producing significant speedup.more » « less
- 
            Modern scientific workflows desire to mix several different comput- ing modalities: self-contained computational tasks, data-intensive transformations, and serverless function calls. To date, these modali- ties have required distinct system architectures with different sched- uling objectives and constraints. In this paper, we describe how TaskVine, a new workflow execution platform, combines these modalities into an execution platform with shared abstractions. We demonstrate results of the system executing a machine learning workflow with combined standalone tasks and serverless functions.more » « less
- 
            Distributed data management systems often operate on “elastic” clusters that can scale up or down on demand. These systems face numerous challenges, including data fragmentation, replication, and cluster sizing. Unfortunately, these challenges have traditionally been treated independently, leaving administrators with little insight on how the interplay of these decisions affects query performance. This paper introduces NashDB, an adaptive data distribution framework that relies on an economic model to automatically balance the supply and demand of data fragments, replicas, and cluster nodes. NashDB adapts its decisions to query priorities and shifting workloads, while avoiding underutilized cluster nodes and redundant replicas. This paper introduces and evaluates NashDB’s model, as well as a suite of optimization techniques designed to efficiently identify data distribution schemes that match workload demands and transition the system to this new scheme with minimum data transfer overhead. Experimentally, we show that NashDB is often Pareto dominant compared to other solutions.more » « less
- 
            null (Ed.)Parallel filesystems (PFSs) are one of the most critical high-availability components of High Performance Computing (HPC) systems. Most HPC workloads are dependent on the availability of a POSIX compliant parallel filesystem that provides a globally consistent view of data to all compute nodes of a HPC system. Because of this central role, failure or performance degradation events in the PFS can impact every user of a HPC resource. There is typically insufficient information available to users and even many HPC staff to identify the causes of these PFS events, impeding the implementation of timely and targeted remedies to PFS issues. The relevant information is distributed across PFS servers; however, access to these servers is highly restricted due to the sensitive role they play in the operations of a HPC system. Additionally, the information is challenging to aggregate and interpret, relegating diagnosis and treatment of PFS issues to a select few experts with privileged system access. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    