skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: License Forecasting and Scheduling for HPC
This work focuses on forecasting future license usage for high-performance computing environments and using such predictions to improve the effectiveness of job scheduling. Specifically, we propose a model that carries out both short-term and long-term license usage forecasting and a method of using forecasts to improve job scheduling. Our long-term forecasting model achieves a Mean Absolute Percentage Error (MAPE) as low as 0.26 for a 12-month forecast of daily peak license usage. Our job scheduling experimental results also indicate that wasted work from jobs with insufficient licenses can be reduced by up to 92% without increasing the average license-using job completion times, during periods of high license usage, with our proposed license-aware scheduler.  more » « less
Award ID(s):
1822923
PAR ID:
10552537
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-1948-4
Page Range / eLocation ID:
1 to 8
Format(s):
Medium: X
Location:
Stony Brook, NY, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Motivated by recent work on scheduling with predicted job sizes, we consider the performance of scheduling algorithms with minimal advice, namely a single bit. Besides demonstrating the power of very limited advice, such schemes are quite natural. In the prediction setting, one bit of advice can be used to model a simple prediction as to whether a job is “large” or “small”; that is, whether a job is above or below a given threshold. Further, one-bit advice schemes can correspond to mechanisms that tell whether to put a job at the front or the back for the queue, a limitation which may be useful in many implementation settings. Finally, queues with a single bit of advice have a simple enough state that they can be analyzed in the limiting mean-field analysis framework for the power of two choices. Our work follows in the path of recent work by showing that even small amounts of even possibly inaccurate information can greatly improve scheduling performance 
    more » « less
  2. Accurately forecasting stress may enable people to make behavioral changes that could improve their future health. For example, accurate stress forecasting might inspire people to make changes to their schedule to get more sleep or exercise, in order to reduce excessive stress tomorrow night. In this paper, we examine how accurately the previous N-days of multi-modal data can forecast tomorrow evening’s high/low binary stress levels using long short-term memory neural network models (LSTM), logistic regression (LR), and support vector machines (SVM). Using a total of 2,276 days, with 1,231 overlapping 8-day sequences of data from 142 participants (including physiological signals, mobile phone usage, location, and behavioral surveys), we find the LSTM significantly outperforms LR and SVM with the best results reaching 83.6% using 7 days of prior data. Using time-series models improves the forecasting of stress even when considering only subsets of the multi-modal data set, e.g., using only physiology data. In particular, the LSTM model reaches 81.4% accuracy using only objective and passive data, i.e., not including subjective reports from a daily survey. 
    more » « less
  3. This work presents a framework for estimating job wait times in High-Performance Computing (HPC) scheduling queues, leverag- ing historical job scheduling data and real-time system metrics. Using machine learning techniques, specifically Random Forest and Multi-Layer Perceptron (MLP) models, we demonstrate high accuracy in predicting wait times, achieving 94.2% reliability within a 10-minute error margin. The framework incorporates key fea- tures such as requested resources, queue occupancy, and system utilization, with ablation studies revealing the significance of these features. Additionally, the framework offers users wait time esti- mates for different resource configurations, enabling them to select optimal resources, reduce delays, and accelerate computational workloads. Our approach provides valuable insights for both users and administrators to optimize job scheduling, contributing to more efficient resource management and faster time to scientific results. 
    more » « less
  4. Due to a growing interest in deep learning applications [5], compute-intensive and long-running (hours to days) training jobs have become a significant component of datacenter workloads. A large fraction of these jobs is often exploratory, with the goal of determining the best model structure (e.g., the number of layers and channels in a convolutional neural network), hyperparameters (e.g., the learning rate), and data augmentation strategies for the target application. Notably, training jobs are often terminated early if their learning metrics (e.g., training and validation accuracy) are not converging, with only a few completing successfully. For this motivating application, we consider the problem of scheduling a set of jobs that can be terminated at predetermined checkpoints with known probabilities estimated from historical data. We prove that, in order to minimize the time to complete the first K successful jobs on a single server, optimal scheduling does not require preemption (even when preemption overhead is negligible) and provide an optimal policy; advantages of this policy are quantified through simulation. Related Work. While job scheduling has been investigated extensively in many scenarios (see [6] and [2] for a survey of recent result), most policies require that the cost of waiting times of each job be known at scheduling time; in contrast, in our setting the scheduler does not know which job will be the K-th successful job, and sojourn times of subsequent jobs do not contribute to the target metric. For example, [4, 3] minimize makespan (i.e., the time to complete all jobs) for known execution times and waiting time costs; similarly, Gittins index [1] and SR rank [7] minimize expected sojourn time of all jobs, i.e., both successfully completed jobs and jobs terminated early. Unfortunately, scheduling policies not distinguishing between these two types of jobs may favor jobs where the next stage is short and leads to early termination with high probability, which is an undesirable outcome in our applications of interest. 
    more » « less
  5. Today’s high-performance computing (HPC) platforms are still dominated by batch jobs. Accordingly, effective batch job scheduling is crucial to obtain high system efficiency. Existing HPC batch job schedulers typically leverage heuristic priority functions to prioritize and schedule jobs. But, once configured and deployed by the experts, such priority function scan hardly adapt to the changes of job loads, optimization goals, or system settings, potentially leading to degraded system efficiency when changes occur. To address this fundamental issue, we present RLScheduler, an automated HPC batch job scheduler built on reinforcement learning. RLScheduler relies on minimal manual interventions or expert knowledge, but can learn high-quality scheduling policies via its own continuous ‘trial and error’. We introduce a new kernel-based neural network structure and trajectory filtering mechanism in RLScheduler to improve and stabilize the learning process. Through extensive evaluations,we confirm that RLScheduler can learn high-quality scheduling policies towards various workloads and various optimization goals with relatively low computation cost. Moreover, we show that the learned models perform stably even when applied to unseen workloads, making them practical for production use. 
    more » « less