<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcq="http://purl.org/dc/terms/"><records count="1" morepages="false" start="1" end="1"><record rownumber="1"><dc:product_type>Conference Paper</dc:product_type><dc:title>Exploiting ML Task Correlation in the Minimization of Capital Expense for GPU Data Centers</dc:title><dc:creator>Subramaniyan, Srinivasan; Wang, Xiaorui</dc:creator><dc:corporate_author/><dc:editor/><dc:description>Efficiently scheduling ML training tasks in a GPU data center presents a significant research challenge. Existing solutions commonly schedule such tasks based on their demanded GPU utilization, but simply assume that the GPU utilization of each task can be approximated as a constant number (e.g., by using the peak value), even though the ML training tasks commonly have their GPU utilization varying significantly over time. Using a constant number to schedule tasks can result in an overestimation of the needed GPU count and, therefore, a high capital expense for GPU purchases. 

To address this, we design CorrGPU, a correlation-aware GPU scheduling algorithm that considers the utilization correlation among different tasks to minimize the number of needed GPUs in a data center. CorrGPU is designed based on a key observation from the analysis of real ML traces that different tasks do not have their GPU utilization peak at exactly the same time. As a result, if the correlations among tasks are considered in scheduling, more tasks can be scheduled onto the same GPUs, without extending the training duration beyond the desired due time. For a GPU data center to be constructed based on an estimated ML workload, CorrGPU can help the operators purchase a smaller number of GPUs, thus minimizing their capital expense. Our hardware testbed results demonstrate CorrGPU’s potential to reduce the number of GPUs needed. Our simulation results on real-world ML traces also show that CorrGPU outperforms several state-of-the-art solutions by reducing capital expense by 20.88%. This work was published in the 44th IEEE International Performance Computing and Communications Conference (IPCCC 2025) in November 2025. Our paper received the Best Paper Runner-up Award from IPCCC.</dc:description><dc:publisher>The 44th IEEE International Performance Computing and Communications Conference (IPCCC 2025), Austin, Texas, November 2025.</dc:publisher><dc:date>2025-11-21</dc:date><dc:nsf_par_id>10652009</dc:nsf_par_id><dc:journal_name/><dc:journal_volume/><dc:journal_issue/><dc:page_range_or_elocation/><dc:issn/><dc:isbn/><dc:doi>https://doi.org/</dc:doi><dcq:identifierAwardId>2336886</dcq:identifierAwardId><dc:subject/><dc:version_number/><dc:location/><dc:rights/><dc:institution/><dc:sponsoring_org>National Science Foundation</dc:sponsoring_org></record></records></rdf:RDF>