Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Liu, Xiaozhen  (ORCID:0009000653467028); Huang, Yicong  (ORCID:0000000211864803); Lin, Xinyuan  (ORCID:0000000179350035); Kumar, Avinash  (ORCID:0009000693273906); Alsudais, Sadeem  (ORCID:000000033928690X); Li, Chen  (ORCID:0000000180156870)

doi:10.1145/3698832

Citation Details

Pasta: A Cost-Based Optimizer for Generating Pipelining Schedules for Dataflow DAGs

Data analytics tasks are often formulated as data workflows represented as directed acyclic graphs (DAGs) of operators. The recent trend of adopting machine learning (ML) techniques in workflows results in increasingly complicated DAGs with many operators and edges. Compared to the operator-at-a-time execution paradigm, pipelined execution has benefits of reducing the materialization cost of intermediate results and allowing operators to produce results early, which are critical in iterative analysis on large data volumes. Correctly scheduling a workflow DAG for pipelined execution is non-trivial due to the richer semantics of operators and the increasing complexity of DAGs. Several existing data systems adopt simple heuristics to solve the problem without considering costs such as materialization sizes. In this paper, we systematically study the problem of scheduling a workflow DAG for pipelined execution, and develop a novel cost-based optimizer called Pasta for generating a high-quality schedule. The Pasta optimizer is not only general and applicable to a wide variety of cost functions, but also capable of utilizing properties inherent in a broad class of cost functions to improve its performance significantly. We conducted a thorough evaluation of developed techniques on real-world workflows and show the efficiency and efficacy of these solutions. more »

Award ID(s):: 2107150

PAR ID:: 10642247

Author(s) / Creator(s):: Liu, Xiaozhen ; Huang, Yicong ; Lin, Xinyuan ; Kumar, Avinash ; Alsudais, Sadeem ; Li, Chen

Publisher / Repository:: ACM

Date Published:: 2024-12-18

Journal Name:: Proceedings of the ACM on Management of Data

Volume:: 2

Issue:: 6

ISSN:: 2836-6573

Page Range / eLocation ID:: 1 to 26

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1145/3698832

More Like this