TaskVine: Managing In-Cluster Storage for High-Throughput Data Intensive Workflows

Sly-Delgado, Barry; Phung, Thanh Son; Thomas, Colin; Simonetti, David; Hennessee, Andrew; Tovar, Ben; Thain, Douglas

doi:10.1145/3624062.3624277

Citation Details

TaskVine: Managing In-Cluster Storage for High-Throughput Data Intensive Workflows

Many scientific applications are expressed as high-throughput workflows that consist of large graphs of data assets and tasks to be executed on large parallel and distributed systems. A chal- lenge in executing these workflows is managing data: both datasets and software must be efficiently distributed to cluster nodes; inter- mediate data must be conveyed between tasks; output data must be delivered to its destination. Scaling problems result when these actions are performed in an uncoordinated manner on a shared filesystem. To address this problem, we introduce TaskVine: a sys- tem for exploiting the aggregate local storage and network capacity of a large cluster. TaskVine tracks the lifetime of data in a workflow –from archival sources to final outputs– making use of local storage to distribute, and re-use data wherever possible. We describe the architecture and novel capabilities of TaskVine, and demonstrate its use with applications in genomics, high energy physics, molecular dynamics, and machine learning. more »

Award ID(s):: 1931348

PAR ID:: 10567835

Author(s) / Creator(s):: Sly-Delgado, Barry; Phung, Thanh Son; Thomas, Colin; Simonetti, David; Hennessee, Andrew; Tovar, Ben; Thain, Douglas

Publisher / Repository:: ACM

Date Published:: 2023-11-12

ISBN:: 9798400707858

Page Range / eLocation ID:: 1978 to 1988

Format(s):: Medium: X

Location:: Denver CO USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3624062.3624277

More Like this