NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Nitro: Boosting Distributed Reinforcement Learning with Serverless Computing

https://doi.org/10.14778/3696435.3696441

Yu, Hanfei; Carter, Jacob; Wang, Hao; Tiwari, Devesh; Li, Jian; Park, Seung-Jong (September 2025, Proceedings of the VLDB Endowment)

Deep reinforcement learning (DRL) has demonstrated significant potential in various applications, including gaming AI, robotics, and system scheduling. DRL algorithms produce, sample, and learn from training data online through a trial-and-error process, demanding considerable time and computational resources. To address this, distributed DRL algorithms and paradigms have been developed to expedite training using extensive resources. Through carefully designed experiments, we are the first to observe that strategically increasing the actor-environment interactions by spawning more concurrent actors at certain training rounds within ephemeral time frames can significantly enhance training efficiency. Yet, current distributed DRL solutions, which are predominantly server-based (or serverful), fail to capitalize on these opportunities due to their long startup times, limited adaptability, and cumbersome scalability. This paper proposesNitro, a generic training engine for distributed DRL algorithms that enforces timely and effective boosting with concurrent actors instantaneously spawned by serverless computing. With serverless functions,Nitroadjusts data sampling strategies dynamically according to the DRL training demands.Nitroseizes the opportunity of real-time boosting by accurately and swiftly detecting an empirical metric. To achieve cost efficiency, we design a heuristic actor scaling algorithm to guideNitrofor cost-aware boosting budget allocation. We integrateNitrowith state-of-the-art DRL algorithms and frameworks and evaluate them on AWS EC2 and Lambda. Experiments with Mujoco and Atari benchmarks show thatNitroimproves the final rewards (i.e., training quality) by up to 6× and reduces training costs by up to 42%.
more » « less
Free, publicly-accessible full text available September 1, 2026
Stellaris: Staleness-Aware Distributed Reinforcement Learning with Serverless Computing

https://doi.org/10.1109/SC41406.2024.00045

Yu, Hanfei; Wang, Hao; Tiwari, Devesh; Li, Jian; Park, Seung-Jong (November 2024, IEEE)

Free, publicly-accessible full text available November 17, 2025
Freyr+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning

https://doi.org/10.1109/TPDS.2024.3462294

Yu, Hanfei; Wang, Hao; Li, Jian; Yuan, Xu; Park, Seung-Jong (November 2024, IEEE Transactions on Parallel and Distributed Systems)

Free, publicly-accessible full text available November 1, 2025
RainbowCake: Mitigating Cold-starts in Serverless with Layer-wise Container Caching and Sharing

https://doi.org/10.1145/3617232.3624871

Yu, Hanfei; Basu_Roy, Rohan; Fontenot, Christian; Tiwari, Devesh; Li, Jian; Zhang, Hong; Wang, Hao; Park, Seung-Jong (April 2024, ACM)

Full Text Available
Libra: Harvesting Idle Resources Safely and Timely in Serverless Clusters

https://doi.org/10.1145/3588195.3592996

Yu, Hanfei; Fontenot, Christian; Wang, Hao; Li, Jian; Yuan, Xu; Park, Seung-Jong (August 2023, ACM)

Serverless computing has been favored by users and infrastructure providers from various industries, including online services and scientific computing. Users enjoy its auto-scaling and ease-of-management, and providers own more control to optimize their service. However, existing serverless platforms still require users to pre-define resource allocations for their functions, leading to frequent misconfiguration by inexperienced users in practice. Besides, functions' varying input data further escalate the gap between their dynamic resource demands and static allocations, leaving functions either over-provisioned or under-provisioned. This paper presents Libra, a safe and timely resource harvesting framework for multi-node serverless clusters. Libra makes precise harvesting decisions to accelerate function invocations with harvested resources and jointly improve resource utilization by profiling dynamic resource demands and availability proactively. Experiments on OpenWhisk clusters with real-world workloads show that Libra reduces response latency by 39% and achieves 3X resource utilization compared to state-of-the-art solutions.
more » « less
Full Text Available
Accelerating Serverless Computing by Harvesting Idle Resources

https://doi.org/10.1145/3485447.3511979

Yu, Hanfei; Wang, Hao; Li, Jian; Yuan, Xu; Park, Seung-Jong (April 2022, Proceedings of the ACM Web Conference 2022)

Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function’s resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs.
more » « less
Full Text Available
A hybrid and scalable error correction algorithm for indel and substitution errors of long reads

https://doi.org/10.1186/s12864-019-6286-9

Das, Arghya Kusum; Goswami, Sayan; Lee, Kisung; Park, Seung-Jong (December 2019, BMC Genomics)

Full Text Available
ParLECH: Parallel Long-Read Error Correction with Hadoop

https://doi.org/10.1109/BIBM.2018.8621549

Das, Arghya Kusum; Lee, Kisung; Park, Seung-Jong (December 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM))

Full Text Available
GPU-Accelerated Large-Scale Genome Assembly

https://doi.org/10.1109/IPDPS.2018.00091

Goswami, Sayan; Lee, Kisung; Shams, Shayan; Park, Seung-Jong (May 2018, 2018 IEEE International Parallel and Distributed Processing Symposium)

Full Text Available
Deep Generative Breast Cancer Screening and Diagnosis

https://doi.org/10.1007/978-3-030-00934-2_95

Shams, Shayan; Platania, Richard; Zhang, Jian; Kim, Joohyun; Lee, Kisung; Park, Seung-Jong (September 2018, International Conference on Medical Image Computing and Computer-Assisted Intervention)

Full Text Available

« Prev Next »

Search for: All records