Search for: All records

Award ID contains: 1717763

« Prev Next »

Total Resources

6

Resource Type
Conference Paper

5

Conference Proceeding

0

Dataset

0

Journal Article

1

Workshop Report

0

Availability
Full Text / Resource Available

6

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DRAS-CQSim: A reinforcement learning based framework for HPC cluster scheduling

https://doi.org/https://doi.org/10.1016/j.simpa.2021.100077

Fan, Yuping ; Lan, Zhiling ( April 2021 , Software impacts)
null (Ed.)
Full Text Available
Union: An Automatic Workload Manager for Accelerating Network Simulation

https://doi.org/10.1109/IPDPS47924.2020.00089

Wang, Xin ; Mubarak, Misbah ; Kang, Yao ; Ross, Robert B. ; Lan, Zhiling ( May 2020 , 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS))

With the rapid growth of the machine learning applications, the workloads of future HPC systems are anticipated to be a mix of scientific simulation, big data analytics, and machine learning applications. Simulation is a great research vehicle to understand the performance implications of co-running scientific applications with big data and machine learning workloads on large-scale systems. In this paper, we present Union, a workload manager that provides an automatic framework to facilitate hybrid workload simulation in CODES. Furthermore, we use Union, along with CODES, to investigate various hybrid workloads composed of traditional simulation applications and emerging learning applications on two dragonfly systems. The experiment results show that both message latency and communication time are important performance metrics to evaluate network interference. Network interference on HPC applications is more reflected by the message latency variation, whereas ML application performance depends more on the communication time.
more » « less
Full Text Available
Scheduling Beyond CPUs for HPC

https://doi.org/10.1145/3307681.3325401

Fan, Yuping ; Lan, Zhiling ; Rich, Paul ; Allcock, William ; Papka, Michael ; Austin, Brian ; Paul, David ( July 2019 , HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing)

High performance computing (HPC) is undergoing significant changes. The emerging HPC applications comprise both compute- and data-intensive applications. To meet the intense I/O demand from emerging data-intensive applications, burst buffers are deployed in production systems. Existing HPC schedulers are mainly CPU-centric. The extreme heterogeneity of hardware devices, combined with workload changes, forces the schedulers to consider multiple resources (e.g., burst buffers) beyond CPUs, in decision making. In this study, we present a multi-resource scheduling scheme named BBSched that schedules user jobs based on not only their CPU requirements, but also other schedulable resources such as burst buffer. BBSched formulates the scheduling problem into a multi-objective optimization (MOO) problem and rapidly solves the problem using a multi-objective genetic algorithm. The multiple solutions generated by BBSched enables system managers to explore potential tradeoffs among various resources, and therefore obtains better utilization of all the resources. The trace-driven simulations with real system workloads demonstrate that BBSched improves scheduling performance by up to 41% compared to existing methods, indicating that explicitly optimizing multiple resources beyond CPUs is essential for HPC scheduling.
more » « less
Full Text Available
The Effect of System Utilization on Application Performance Variability

Li, Boyang ; Chunduri, Sudheer ; Harms, Kevin ; Fan, Yuping ; Lan, Zhiling ( June 2019 , ACM Digital Library)

Application performance variability caused by network contention is a major issue on dragonfly based systems. This work-in-progress study makes two contributions. First, we analyze real workload logs and conduct application experiments on the production system Theta at Argonne to evaluate application performance variability. We find a strong correlation between system utilization and performance variability where a high system utilization (e.g., above 95%) can cause up to 21% degradation in application performance. Next, driven by this key finding, we investigate a scheduling policy to mitigate workload interference by leveraging the fact that production systems often exhibit diurnal utilization behavior and not all users are in a hurry for job completion. Preliminary results show that this scheduling design is capable of improving system productivity (measured by scheduling makespan) as well as improving user-level scheduling metrics such as user wait time and job slowdown.
more » « less
Full Text Available
Modeling and Analysis of Application Interference on Dragonfly+

Kang, Yao ; Wang, Xin ; McGlohon, Neil ; Mubarak, Misbah ; Chunduri, Sudheer ; Lan, Zhiling ( June 2019 , ACM SIGSIM PADS)

Dragonfly class of networks are considered as promising interconnects for next-generation supercomputers. While Dragonfly+ networks offer more path diversity than the original Dragonfly design, they are still prone to performance variability due to their hierarchical architecture and resource sharing design. Event-driven network simulators are indispensable tools for navigating complex system design. In this study, we quantitatively evaluate a variety of application communication interactions on a 3,456-node Dragonfly+ system by using the CODES toolkit. This study looks at the impact of communication interference from a user’s perspective. Specifically, for a given application submitted by a user, we examine how this application will behave with the existing workload running in the system under different job placement policies. Our simulation study considers hundreds of experiment configurations including four target applications with representative communication patterns under a variety of network traffic conditions. Our study shows that intra-job interference can cause severe performance degradation for communication-intensive applications. Inter-job interference can generally be reduced for applications with one-toone or one-to-many communication patterns through job isolation. Application with one-to-all communication pattern is resilient to network interference.
more » « less
Full Text Available
Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System

Wang, Xin ; Mubarak, Misbah ; Yang, Xu ; Ross, Rob ; Lan, Zhiling ( May 2018 , IPDPS .... [proceedings])

Dragonfly networks are being widely adopted in high-performance computing systems. On these networks, however, interference caused by resource sharing can lead to significant network congestion and performance variability. We present a comparative analysis exploring the trade-off between localizing communication and balancing network traffic. We conduct trace-based simulations for applications with different communication patterns, using multiple job placement policies and routing mechanisms. We perform an in-depth performance analysis on representative applications individually and show that different applications have distinct preferences regarding localized communication and balanced network traffic. We further demonstrate the effect of external network interference by introducing background traffic and show that localized communication can help reduce the application performance variation caused by network sharing.
more » « less
Full Text Available