Search for: All records

Award ID contains: 1816850

« Prev Next »

Total Resources

13

Resource Type
Conference Paper

11

Conference Proceeding

0

Dataset

0

Journal Article

2

Workshop Report

0

Availability
Full Text / Resource Available

13

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Holmes: SMT Interference Diagnosis and CPU Scheduling for Job Co-location

https://doi.org/10.1145/3502181.3531464

Pi, Aidi ; Zhou, Xiaobo ; Xu, Chengzhong ( June 2022 , HPDC '22: Proceedings of the 31st ACM International Symposium on High-Performance Parallel and Distributed Computing)

Full Text Available
Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling

https://doi.org/10.1109/TPDS.2021.3104242

Wang, Shaoqi ; Pi, Aidi ; Zhou, Xiaobo ( May 2022 , IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
Memory at your service: fast memory allocation for latency-critical services

https://doi.org/https://doi.org/10.1145/3464298.3493394

Pi, Aidi ; Zhao, Junxian ; Wang, Shaoqi ; Zhou, Xiaobo ( December 2021 , Middleware '21: Proceedings of the 22nd International Middleware Conference)

Full Text Available
Overlapping Communication With Computation in Parameter Server for Scalable DL Training

https://doi.org/10.1109/TPDS.2021.3062721

Wang, Shaoqi ; Pi, Aidi ; Zhou, Xiaobo ; Wang, Jun ; Xu, Cheng-Zhong ( September 2021 , IEEE Transactions on Parallel and Distributed Systems)
null (Ed.)
Full Text Available
FlashByte: Improving Memory Efficiency with Lightweight Native Storage

https://doi.org/10.1109/CCGrid51090.2021.00016

Zhao, Junxian ; Pi, Aidi ; Wang, Shaoqi ; Zhou, Xiaobo ( May 2021 , 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid))

Full Text Available
An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems

https://doi.org/10.1109/SC41405.2020.00094

Wang, Shaoqi ; Gonzalez, Oscar J. ; Zhou, Xiaobo ; Williams, Thomas ; Friedman, Brian D. ; Havemann, Martin ; Woo, Thomas ( November 2020 , SC20: International Conference for High Performance Computing, Networking, Storage and Analysis)
null (Ed.)
Full Text Available
OS-Augmented Oversubscription of Opportunistic Memory with a User-Assisted OOM Killer

https://doi.org/https://doi.org/10.1145/3361525.3361534

Chen, Wei ; Pi, Aidi ; Wang, Shaoqi ; Zhou, Xiaobo ( December 2019 , ACM Middleware '19: Proceedings of the 20th International Middleware Conference)

Exploiting opportunistic memory by oversubscription is an appealing approach to improving cluster utilization and throughput. In this paper, we find the efficacy of memory oversubscription depends on whether or not the oversubscribed tasks can be killed by an OutOf Memory (OOM) killer in a timely manner to avoid significant memory thrashing upon memory pressure. However, current approaches in modern cluster schedulers are actually unable to unleash the power of opportunistic memory because their user space OOM killers are unable to timely deliver a task killing signal to terminate the oversubscribed tasks. Our experiments observe that a user space OOM killer fails to do that because of lacking the memory pressure knowledge from OS while the kernel space Linux OOM killer is too conservative to relieve memory pressure. In this paper, we design a user-assisted OOM killer (namely UA killer) in kernel space, an OS augmentation for accurate thrashing detection and agile task killing. To identify a thrashing task, UA killer features a novel mechanism, constraint thrashing. Upon UA killer, we develop Charon, a cluster scheduler for oversubscription of opportunistic memory in an on-demand manner. We implement Charon upon Mercury, a state-of-the-art opportunistic cluster scheduler. Extensive experiments with a Google trace in a 26-node cluster show that Charon can: (1) achieve agile task killing, (2) improve the best-effort job throughput by 3.5X over Mercury while prioritizing the production jobs, and (3) improve the 90th job completion time of production jobs over Kubernetes opportunistic scheduler by 62%.
more » « less
Full Text Available
OS-Augmented Oversubscription of Opportunistic Memory with a User-Assisted OOM Killer

https://doi.org/doi.org/10.1145/3361525.3361534

Chen, Wei ; Pi, Aidi ; Wang, Shaoqi ; Zhou, Xiaobo ( December 2019 , Middleware '19: ACM Proceedings of the 20th International Middleware Conference)

Exploiting opportunistic memory by oversubscription is an appealing approach to improving cluster utilization and throughput. In this paper, we find the efficacy of memory oversubscription depends on whether or not the oversubscribed tasks can be killed by an OutOfMemory (OOM) killer in a timely manner to avoid significant memory thrashing upon memory pressure. However, current approaches in modern cluster schedulers are actually unable to unleash the power of opportunistic memory because their user space OOM killers are unable to timely deliver a task killing signal to terminate the oversubscribed tasks. Our experiments observe that a user space OOM killer fails to do that because of lacking the memory pressure knowledge from OS while the kernel space Linux OOM killer is too conservative to relieve memory pressure. In this paper, we design a user-assisted OOM killer (namely UA killer) in kernel space, an OS augmentation for accurate thrashing detection and agile task killing. To identify a thrashing task, UA killer features a novel mechanism, constraint thrashing. Upon UA killer, we develop Charon, a cluster scheduler for oversubscription of opportunistic memory in an on-demand manner. We implement Charon upon Mercury, a state-of-the-art opportunistic cluster scheduler. Extensive experiments with a Google trace in a 26-node cluster show that Charon can: (1) achieve agile task killing, (2) improve the best-effort job throughput by 3.5X over Mercury while prioritizing the production jobs, and (3) improve the 90th job completion time of production jobs over Kubernetes opportunistic scheduler by 62%.
more » « less
Full Text Available
Pufferfish: Container-driven Elastic Memory Management for Data-intensive Applications

https://doi.org/https://doi.org/10.1145/3357223.3362730

Chen, Wei ; Pi, Aidi ; Wang, Shaoqi ; Zhou, Xiaobo ( November 2019 , ACM SoCC '19: Proceedings of the ACM Symposium on Cloud Computing)

Data-intensive applications often suffer from significant memory pressure, resulting in excessive garbage collection (GC) and out-of-memory (OOM) errors, harming system performance and reliability. In this paper, we demonstrate how lightweight virtualization via OS containers opens up opportunities to address memory pressure and realize memory elasticity: 1) tasks running in a container can be set to a large heap size to avoid OutOfMemory (OOM) errors, and 2) tasks that are under memory pressure and incur significant swapping activities can be temporarily "suspended" by depriving resources from the hosting containers, and be "resumed" when resources are available. We propose and develop Pufferfish, an elastic memory manager, that leverages containers to flexibly allocate memory for tasks. Memory elasticity achieved by Pufferfish can be exploited by a cluster scheduler to improve cluster utilization and task parallelism. We implement Pufferfish on the cluster scheduler Apache Yarn. Experiments with Spark and MapReduce on real-world traces show Pufferfish is able to avoid OOM errors, improve cluster memory utilization by 2.7x and the median job runtime by 5.5x compared to a memory over-provisioning solution.
more » « less
Full Text Available
Pufferfish: Container-driven Elastic Memory Management for Data-intensive Applications

https://doi.org/doi.org/10.1145/3357223.3362730

Chen, Wei ; Pi, Aidi ; Wang, Shaoqi ; Zhou, Xiaobo ( November 2019 , SoCC '19: ACM Proceedings of the ACM Symposium on Cloud Computing)

Data-intensive applications often suffer from significant memory pressure, resulting in excessive garbage collection (GC) and out-of-memory (OOM) errors, harming system performance and reliability. In this paper, we demonstrate how lightweight virtualization via OS containers opens up opportunities to address memory pressure and realize memory elasticity: 1) tasks running in a container can be set to a large heap size to avoid OutOfMemory (OOM) errors, and 2) tasks that are under memory pressure and incur significant swapping activities can be temporarily "suspended" by depriving resources from the hosting containers, and be "resumed" when resources are available. We propose and develop Pufferfish, an elastic memory manager, that leverages containers to flexibly allocate memory for tasks. Memory elasticity achieved by Pufferfish can be exploited by a cluster scheduler to improve cluster utilization and task parallelism. We implement Pufferfish on the cluster scheduler Apache Yarn. Experiments with Spark and MapReduce on real-world traces show Pufferfish is able to avoid OOM errors, improve cluster memory utilization by 2.7x and the median job runtime by 5.5x compared to a memory over-provisioning solution.
more » « less
Full Text Available

« Prev Next »