NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs

Carver, Benjamin; Zhang, Jingyuan; Wang, Haoliang; Mahadik, Kanak; Cheng, Yue

Citation Details

Interactive notebook programming is universal in modern ML and AI workflows, with interactive deep learning training (IDLT) emerging as a dominant use case. To ensure responsiveness, platforms like Jupyter and Colab reserve GPUs for long-running notebook sessions, despite their intermittent and sporadic GPU usage, leading to extremely low GPU utilization and prohibitively high costs. In this paper, we introduce NotebookOS, a GPU-efficient notebook platform tailored for the unique requirements of IDLT. NotebookOS employs replicated notebook kernels with Raft-synchronized replicas distributed across GPU servers. To optimize GPU utilization, NotebookOS oversubscribes server resources, leveraging high inter-arrival times in IDLT workloads, and allocates GPUs only during active cell execution. It also supports replica migration and automatic cluster scaling under high load. Altogether, this design enables interactive training with minimal delay. In evaluation on production workloads, NotebookOS saved over 1,187 GPU hours in 17.5 hours of real-world IDLT, while significantly improving interactivity. more »

Award ID(s):: 2411009 2322860

PAR ID:: 10632601

Author(s) / Creator(s):: Carver, Benjamin; Zhang, Jingyuan; Wang, Haoliang; Mahadik, Kanak; Cheng, Yue

Publisher / Repository:: The ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Date Published:: 2026-03-22

Subject(s) / Keyword(s):: Jupyter Notebook Interactive Deep Learning Training GPU Scheduling Systems for AI

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
The DOI is not currently available.

More Like this