NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Exploiting ML Task Correlation in the Minimization of Capital Expense for GPU Data Centers

Subramaniyan, Srinivasan; Wang, Xiaorui (November 2025, The 44th IEEE International Performance Computing and Communications Conference (IPCCC 2025), Austin, Texas, November 2025.)

Efficiently scheduling ML training tasks in a GPU data center presents a significant research challenge. Existing solutions commonly schedule such tasks based on their demanded GPU utilization, but simply assume that the GPU utilization of each task can be approximated as a constant number (e.g., by using the peak value), even though the ML training tasks commonly have their GPU utilization varying significantly over time. Using a constant number to schedule tasks can result in an overestimation of the needed GPU count and, therefore, a high capital expense for GPU purchases. To address this, we design CorrGPU, a correlation-aware GPU scheduling algorithm that considers the utilization correlation among different tasks to minimize the number of needed GPUs in a data center. CorrGPU is designed based on a key observation from the analysis of real ML traces that different tasks do not have their GPU utilization peak at exactly the same time. As a result, if the correlations among tasks are considered in scheduling, more tasks can be scheduled onto the same GPUs, without extending the training duration beyond the desired due time. For a GPU data center to be constructed based on an estimated ML workload, CorrGPU can help the operators purchase a smaller number of GPUs, thus minimizing their capital expense. Our hardware testbed results demonstrate CorrGPU’s potential to reduce the number of GPUs needed. Our simulation results on real-world ML traces also show that CorrGPU outperforms several state-of-the-art solutions by reducing capital expense by 20.88%. This work was published in the 44th IEEE International Performance Computing and Communications Conference (IPCCC 2025) in November 2025. Our paper received the Best Paper Runner-up Award from IPCCC.
more » « less
Free, publicly-accessible full text available November 21, 2026
SEEB-GPU: Early-Exit Aware Scheduling and Batching for Edge GPU Inference

https://doi.org/10.1145/3769102.3772715

Subramaniyan, Srinivasan; Joshi, Rudra; Wang, Xiaorui; Brocanelli, Marco (December 2025, ACM)

Free, publicly-accessible full text available December 3, 2026
Power Capping of GPU Servers for Machine Learning Inference Optimization

Ma, Yuan; Subramaniyan, Srinivasan; Wang, Xiaorui (September 2025, The 54th International Conference on Parallel Processing (ICPP 2025), San Diego, California, September 2025.)

Free, publicly-accessible full text available September 8, 2026
MAPP: Predictive UI View Pre-caching for Improving the Responsiveness of Mobile Apps

Wang, Run; Herman, Zach; Brocanelli, Marco; Wang, Xiaorui (July 2025, IEEE/ACM International Symposium on Quality of Service (IWQoS 2025))

When mobile apps are used extensively in our daily lives, their responsiveness has become an important factor that can negatively impact the user experience. The long response time of a mobile app can be caused by a variety of reasons, including soft hang bugs or prolonged user interface APIs (UI-APIs). While hang bugs have been researched extensively before, our investigation on UI-APIs in today’s mobile OS finds that the recursive construction of UI view hierarchy often can be time-consuming, due to the complexity of today’s UI views. To accelerate UI processing, such complex views can be pre-processed and cached before the user even visits them. However, pre-caching every view in a mobile app is infeasible due to the incurred overheads on time, energy, and cache space. In this paper, we propose MAPP, a framework for Mobile App Predictive Pre-caching. MAPP has two main modules, 1) UI view prediction based on deep learning and 2) UI-API pre-caching, which coordinate to improve the responsiveness of mobile apps. MAPP adopts a per-user and per-app prediction model that is tailored based on the analysis of collected user traces, such as location, time, or the sequence of previously visited views. A dynamic feature ranking and model selection algorithm is designed to judiciously filter out less relevant features for improving the prediction accuracy with less computation overhead. MAPP is evaluated with 61 real-world traces from 18 volunteers over 30 days to show that it can shorten the response time of mobile apps by 59.84% on average with an average cache hit rate of 92.55%.
more » « less
Free, publicly-accessible full text available July 2, 2026
Latency-guaranteed Co-location of Inference and Training for Reducing Data Center Expenses

Chen, Guoyu; Subramaniyan, Srinivasan; Wang, Xiaorui (July 2024, IEEE)

Today's data centers often need to run various machine learning (ML) applications with stringent SLO (Service-Level Objective) requirements, such as inference latency. To that end, data centers prefer to 1) over-provision the number of servers used for inference processing and 2) isolate them from other servers that run ML training, despite both use GPUs extensively, to minimize possible competition of computing resources. Those practices result in a low GPU utilization and thus a high capital expense. Hence, if training and inference jobs can be safely co-located on the same GPUs with explicit SLO guarantees, data centers could flexibly run fewer training jobs when an inference burst arrives and run more afterwards to increase GPU utilization, reducing their capital expenses. In this paper, we propose GPUColo, a two-tier co-location solution that provides explicit ML inference SLO guarantees for co-located GPUs. In the outer tier, we exploit GPU spatial sharing to dynamically adjust the percentage of active GPU threads allocated to spatially co-located inference and training processes, so that the inference latency can be guaranteed. Because spatial sharing can introduce considerable overheads and thus cannot be conducted at a fine time granularity, we design an inner tier that puts training jobs into periodic sleep, so that the inference jobs can quickly get more GPU resources for more prompt latency control. Our hardware testbed results show that GPUColo can precisely control the inference latency to the desired SLO, while maximizing the throughput of the training jobs co-located on the same GPUs. Our large-scale simulation with a 57-day real-world data center trace (6500 GPUs) also demonstrates that GPUColo enables latency-guaranteed inference and training co-location. Consequently, it allows 74.9% of GPUs to be saved for a much lower capital expense.
more » « less
Full Text Available
OptimML: Joint Control of Inference Latency and Server Power Consumption for ML Performance Optimization

https://doi.org/10.1145/3661825

Chen, Guoyu; Wang, Xiaorui (May 2024, ACM Transactions on Autonomous and Adaptive Systems)

Power capping is an important technique for high-density servers to safely oversubscribe the power infrastructure in a data center. However, power capping is commonly accomplished by dynamically lowering the server processors’ frequency levels, which can result in degraded application performance. For servers that run important machine learning (ML) applications with Service-Level Objective (SLO) requirements, inference performance such as recognition accuracy must be optimized within a certain latency constraint, which demands high server performance. In order to achieve the best inference accuracy under the desired latency and server power constraints, this paper proposes OptimML, a multi-input-multi-output (MIMO) control framework that jointly controls both inference latency and server power consumption, by flexibly adjusting the machine learning model size (and so its required computing resources) when server frequency needs to be lowered for power capping. Our results on a hardware testbed with widely adopted ML framework (including PyTorch, TensorFlow, and MXNet) show that OptimML achieves higher inference accuracy compared with several well-designed baselines, while respecting both latency and power constraints. Furthermore, an adaptive control scheme with online model switching and estimation is designed to achieve analytic assurance of control accuracy and system stability, even in the face of significant workload/hardware variations.
more » « less
Full Text Available
Connected vehicle driving safety enhancement via dynamic communication channel selection

https://doi.org/10.1016/j.mechatronics.2021.102512

Wang, Zejiang; Bai, Yunhao; Zha, Jingqiang; Wang, Junmin; Wang, Xiaorui (April 2021, Mechatronics)
null (Ed.)
Full Text Available
AutoE2E: End-to-End Real-time Middleware for Autonomous Driving Control

https://doi.org/10.1109/ICDCS47774.2020.00092

Bai, Yunhao; Wang, Zejiang; Wang, Xiaorui; Wang, Junmin (November 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS))
null (Ed.)
Full Text Available
MC-Safe: Multi-channel Real-time V2V Communication for Enhancing Driving Safety

https://doi.org/10.1145/3394961

Bai, Yunhao; Zheng, Kuangyu; Wang, Zejiang; Wang, Xiaorui; Wang, Junmin (August 2020, ACM Transactions on Cyber-Physical Systems)

Full Text Available
Vehicle Path-Tracking Linear-Time-Varying Model Predictive Control Controller Parameter Selection Considering Central Process Unit Computational Load

https://doi.org/10.1115/1.4042196

Wang, Zejiang; Bai, Yunhao; Wang, Junmin; Wang, Xiaorui (May 2019, Journal of Dynamic Systems, Measurement, and Control)

Model predictive control (MPC) has drawn a considerable amount of attention in automotive applications during the last decade, partially due to its systematic capacity of treating system constraints. Even though having received broad acknowledgements, there still exist two intrinsic shortcomings on this optimization-based control strategy, namely the extensive online calculation burden and the complex tuning process, which hinder MPC from being applied to a wider extent. To tackle these two drawbacks, different methods were proposed. Nevertheless, the majority of these approaches treat these two issues independently. However, parameter tuning in fact has double-sided effects on both the controller performance and the real-time computational burden. Due to the lack of theoretical tools for globally analyzing the complex conflicts among MPC parameter tuning, controller performance optimization, and computational burden easement, a look-up table-based online parameter selection method is proposed in this paper to help a vehicle track its reference path under both the stability and computational capacity constraints. matlab-carsim conjoint simulations show the effectiveness of the proposed strategy.
more » « less
Full Text Available

« Prev Next »

Search for: All records