NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ExoFlow: A Universal Workflow System for Exactly-Once DAGs

Zhuang, Siyuan; Wang, Stephanie; Liang, Eric; Cheng, Yi; Stoica, Ion (July 2023, USENIX Association)

Given the fundamental tradeoff between run-time and recovery performance, current distributed systems often build application-specific recovery strategies to minimize overheads. However, it is increasingly common for different applications to be composed into heterogeneous pipelines. Implementing multiple interoperable recovery techniques in the same system is rare and difficult. Thus, today's users must choose between: (1) building on a single system, and face a fixed choice of performance vs. recovery overheads, or (2) the challenging task of stitching together multiple systems that can offer application-specific tradeoffs. We present ExoFlow, a universal workflow system that enables a flexible choice of recovery vs. performance tradeoffs, even within the same application. The key insight behind our solution is to decouple execution from recovery and provide exactly-once semantics as a separate layer from execution. For generality, workflow tasks can return references that capture arbitrary inter-task communication. To enable the workflow system and therefore the end user to take control of recovery, we design task annotations that specify execution semantics such as nondeterminism. ExoFlow generalizes recovery for existing workflow applications ranging from ETL pipelines to stateful serverless workflows, while enabling further optimizations in task communication and recovery.
more » « less
Full Text Available
Efficient Memory Management for Large Language Model Serving with PagedAttention

https://doi.org/10.1145/3600006.3613165

Kwon, Woosuk; Li, Zhuohan; Zhuang, Siyuan; Sheng, Ying; Zheng, Lianmin; Yu, Cody Hao; Gonzalez, Joseph; Zhang, Hao; Stoica, Ion (October 2023, ACM)

Full Text Available
SkyPilot: An Intercloud Broker for Sky Computing

Yang, Zongheng; Wu, Zhanghao; Luo, Michael; Chiang, Wei-Lin; Bhardwaj, Romil; Kwon, Woosuk; Zhuang, Siyuan; Luan, Frank_Sifei; Mittal, Gautam; Shenker, Scott; et al (April 2023, USENIX Association)

Full Text Available
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, Lianmin; Chiang, Wei-Lin; Sheng, Ying; Zhuang, Siyuan; Wu, Zhanghao; Zhuang, Yonghao; Lin, Zi; Li, Zhuohan; Li, Dacheng; Xing, Eric_P; et al (June 2023, arXiv)

Full Text Available
Hoplite: efficient and fault-tolerant collective communication for task-based distributed systems

https://doi.org/10.1145/3452296.3472897

Zhuang, Siyuan; Li, Zhuohan; Zhuo, Danyang; Wang, Stephanie; Liang, Eric; Nishihara, Robert; Moritz, Philipp; Stoica, Ion (August 2021, SIGCOMM'21: Proceedings of the 2021 ACM SIGCOMM 2021 Conference)

Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively.
more » « less
Full Text Available

Search for: All records