NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Orchestrating a DNN training job using an iScheduler Framework: a use case

Vallabhajosyula, Manikya Swathi; Budhya, Sandeep Satish; Ramnath, Rajiv (July 2024, Practice and Experience in Research Computing Conference Series (PEARC 2024))

Full Text Available
Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads

Vallabhajosyula, Manikya Swathi; Budhya, Sandeep Satish; Ramnath, Rajiv (July 2024, 2024 Practice and Experience in Research Computing Conference Series (PEARC 2024))

Full Text Available
Insights from the HARP Framework: Using an AI-Driven Approach for Eicient Resource Allocation in HPC Scientific Workflows

Vallabhajosyula, Swathi; Ramnath, Rajiv (July 2023, 2023 Practice and Experience in Research Computing Conference Series (PEARC 2023),)

Full Text Available
Towards Characterizing DNNs to Estimate Training Time using HARP (HPC Application Resource (runtime) Predictor

Vallabhajosyula, Manikya Swathi; Ramnath, Rajiv (July 2023, 2023 Practice and Experience in Research Computing Conference Series (PEARC 2023))

Full Text Available
Establishing a Generalizable Framework for Generating Cost-Aware Training Data and Building Unique Context-Aware Walltime Prediction Regression Models

Swathi Vallabhajosyula; Rajiv Ramnath (December 2022, 20th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA 2022))

This paper describes a generalizable framework for creating context-aware wall-time prediction models for HPC applications. This framework: (a) cost-effectively generates comprehensive application-specific training data, (b) provides an application-independent machine learning pipeline that trains different regression models over the training datasets, and (c) establishes context-aware selection criteria for model selection. We explain how most of the training data can be generated on commodity or contention-free cyberinfrastructure and how the predictive models can be scaled to the production environment with the help of a limited number of resource-intensive generated runs (we show almost seven-fold cost reductions along with better performance). Our machine learning pipeline does feature transformation, and dimensionality reduction, then reduces sampling bias induced by data imbalance. Our context-aware model selection algorithm chooses the most appropriate regression model for a given target application that reduces the number of underpredictions while minimizing overestimation errors. Index Terms—AI4CI, Data Science Workflow, Custom ML Models, HPC, Data Generation, Scheduling, Resource Estimations
more » « less
Full Text Available
Towards Practical, Generalizable Machine-Learning Training Pipelines to build Regression Models for Predicting Application Resource Needs on HPC Systems

Swathi Vallabhajosyula; Rajiv Ramnath (July 2022, Practice and Experience in Advanced Research Computing (PEARC))

This paper explores the potential for cost-effectively developing generalizable and scalable machine-learning-based regression models for predicting the approximate execution time of an HPC application given its input data and parameters. This work examines: (a) to what extent models can be trained on scaled-down datasets on commodity environments and adapted to production environments, (b) to what extent models built for specific applications can generalize to other applications within a family, and (c) how the most appropriate model may change based on the type of data and its mix. As part of this work, we also describe and show the use of an automatable pipeline for generating the necessary training data and building the model. CCS Concepts: • Software and its engineering→Designing software; • Computing methodologies→Cost-sensitive learning. Additional Key Words and Phrases: automated data generation, ML, execution time, model scalability, model transferability
more » « less
Full Text Available

Search for: All records