Exploiting ML Task Correlation in the Minimization of Capital Expense for GPU Data Centers

Subramaniyan, Srinivasan; Wang, Xiaorui

Efficiently scheduling ML training tasks in a GPU data center presents a significant research challenge. Existing solutions commonly schedule such tasks based on their demanded GPU utilization, but simply assume that the GPU utilization of each task can be approximated as a constant number (e.g., by using the peak value), even though the ML training tasks commonly have their GPU utilization varying significantly over time. Using a constant number to schedule tasks can result in an overestimation of the needed GPU count and, therefore, a high capital expense for GPU purchases. To address this, we design CorrGPU, a correlation-aware GPU scheduling algorithm that considers the utilization correlation among different tasks to minimize the number of needed GPUs in a data center. CorrGPU is designed based on a key observation from the analysis of real ML traces that different tasks do not have their GPU utilization peak at exactly the same time. As a result, if the correlations among tasks are considered in scheduling, more tasks can be scheduled onto the same GPUs, without extending the training duration beyond the desired due time. For a GPU data center to be constructed based on an estimated ML workload, CorrGPU can help the operators purchase a smaller number of GPUs, thus minimizing their capital expense. Our hardware testbed results demonstrate CorrGPU’s potential to reduce the number of GPUs needed. Our simulation results on real-world ML traces also show that CorrGPU outperforms several state-of-the-art solutions by reducing capital expense by 20.88%. This work was published in the 44th IEEE International Performance Computing and Communications Conference (IPCCC 2025) in November 2025. Our paper received the Best Paper Runner-up Award from IPCCC.

More Like this