NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Scaling Inference-Efficient Language Models

Bian, Song; Yan, Minghao; Venkataraman, Shivaram (July 2025, Proceedings of the 42 nd International Conference on Machine Learning)

Free, publicly-accessible full text available July 13, 2026
Decoding Speculative Decoding

https://doi.org/10.18653/v1/2025.naacl-long.328

Yan, Minghao; Agarwal, Saurabh; Venkataraman, Shivaram (April 2025, Association for Computational Linguistics)

Free, publicly-accessible full text available April 29, 2026
Eva: Cost-Efficient Cloud-Based Cluster Scheduling

https://doi.org/10.1145/3689031.3717483

Chang, Tzu-Tao; Venkataraman, Shivaram (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
Quake: Adaptive Indexing for Vector Search

Mohoney, Jason; Sarda, Devesh; Tang, Mengze; Chowdhury, Shihabur Rahman; Pacaci, Anil; Ilyas, Ihab F; Rekatsinas, Theodoros; Venkataraman, Shivaram (July 2025, 19th USENIX Symposium on Operating Systems Design and Implementation)

Free, publicly-accessible full text available July 9, 2026
TUNA: Tuning Unstable and Noisy Cloud Applications

https://doi.org/10.1145/3689031.3717480

Freischuetz, Johannes; Kanellis, Konstantinos; Kroth, Brian; Venkataraman, Shivaram (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters

Jain, Rutwik; Tran, Brandon; Chen, Keting; Sinclair, Matthew D; Venkataraman, Shivaram (November 2024, ACM)

Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and growing use of ML, including in some scientific applications, optimizing these clusters for ML workloads is particularly important. However, recent work has demonstrated that accelerators in these clusters can suffer from performance variability and this variability can lead to resource under-utilization and load imbalance. In this work we focus on how clusters schedulers, which are used to share accelerator-rich clusters across many concurrent ML jobs, can embrace performance variability to mitigate its effects. Our key insight to address this challenge is to characterize which applications are more likely to suffer from performance variability and take that into account while placing jobs on the cluster. We design a novel cluster scheduler, PAL, which uses performance variability measurements and application-specific profiles to improve job performance and resource utilization. PAL also balances performance variability with locality to ensure jobs are spread across as few nodes as possible. Overall, PAL significantly improves GPU-rich cluster scheduling: across traces for six ML workload applications spanning image, language, and vision models with a variety of variability profiles, PAL improves geomean job completion time by 42%, cluster utilization by 28%, and makespan by 47% over existing state-of-the-art schedulers.
more » « less
Full Text Available
Does compressing activations help model parallel training?

Bian, Song; Li, Dacheng; Wang, Hongyi; Xing, Eric P; Venkataraman, Shivaram (May 2024, Seventh Annual Conference on Machine Learning and Systems)

Foundation models have superior performance across a wide array of machine learning tasks. The training of these models typically involves model parallelism (MP) to navigate the constraints of GPU memory capacity. However, MP strategies involve transmitting model activations between GPUs, which can hinder training speed in large clusters. Previous research has examined gradient compression in data-parallel contexts, but its applicability in MP settings remains largely unexplored. In this paper, we investigate the unique characteristics of compression in MP and study why strategies from gradient compression might not be directly applicable to MP scenarios. Subsequently, to systematically understand the capabilities and limitations of Model Parallelism Compression, we present a benchmarking framework MCBench. MCBench not only includes four major categories of compression algorithms but also includes several widely used models spanning language and vision tasks on a well-established distributed training framework, Megatron-LM. We initiate the first comprehensive empirical study by using MCBench. Our empirical study encompasses both the fine-tuning and pre-training of FMs. We probe over 200 unique training configurations and present results using 10 widely used datasets. To comprehend the scalability of compression advantages with the expansion of model size and cluster size, we propose a novel cost model designed specifically for training with MP compression. The insights derived from our findings can help direct the future development of new MP compression algorithms for distributed training. Our code is available at https://github.com/uw-mad-dash/MCBench
more » « less
Full Text Available
Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Ding, Qiyang; Zheng, Pengfei; Kudari, Shreyas; Venkataraman, Shivaram; Zhang, Zhao (November 2023, SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis)

Full Text Available
Bagpipe: Accelerating Deep Recommendation Model Training

https://doi.org/10.1145/3600006.3613142

Agarwal, Saurabh; Yan, Chengpo; Zhang, Ziyi; Venkataraman, Shivaram (October 2023, ACM)
Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

https://doi.org/10.1145/3581784.3607042

Ding, Qiyang; Zheng, Pengfei; Kudari, Shreyas; Venkataraman, Shivaram; Zhang, Zhao (November 2023, ACM)

Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we propose the design of a proactive provisioner and investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient. Using production job traces from three GPU clusters, we train each model using a subset of the trace and then evaluate their generality using the remaining validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate ML methods. Our experiments show that the Mirage can reduce interruption by 17--100% and safeguard 23%-76% of jobs with zero interruption across varying load levels on the three clusters.
more » « less

« Prev Next »

Search for: All records