NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

https://doi.org/10.1109/HiPC62374.2024.00030

Xu, Yiheng; Sivaraman, Pranav; Devarajan, Hariharan; Mohror, Kathryn; Bhatele, Abhinav (December 2024, Proceedings of the IEEE International Conference on High Performance Computing. IEEE.)

Full Text Available
Understanding Highly Configurable Storage for Diverse Workloads

https://doi.org/10.1109/CLUSTERWorkshops61563.2024.00023

Kogiou, Olga; Devarajan, Hariharan; Wang, Chen; Yu, Weikuan; Mohror, Kathryn (September 2024, IEEE)

Full Text Available
A Visual Comparison of Silent Error Propagation

https://doi.org/10.1109/TVCG.2022.3230636

Li, Zhimin; Menon, Harshitha; Mohror, Kathryn; Liu, Shusen; Guo, Luanzheng; Bremer, Peer-Timo; Pascucci, Valerio (July 2024, IEEE Transactions on Visualization and Computer Graphics)

Full Text Available
DFMan: A Graph-based Optimization of Dataflow Scheduling on High-Performance Computing Systems

https://doi.org/10.1109/IPDPS53621.2022.00043

Chowdhury, Fahim; Di Natale, Francesco; Moody, Adam; Mohror, Kathryn; Yu, Weikuan (May 2022, 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS))

Scientific research and development campaigns are materialized by workflows of applications executing on high-performance computing (HPC) systems. These applications con-sist of tasks that can have inter- or intra-application flows of data to achieve the research goals successfully. These dataflows create dependencies among the tasks and cause resource con-tention on shared storage systems, thus limiting the aggregated I/O bandwidth achieved by the workflow. However, these I/O performance issues are often solved by tedious and manual efforts that demand holistic knowledge about the data dependencies in the workflow and the information about the infrastructure being utilized. Taking this into consideration, we design DFMan, a graph-based dataflow management and optimization framework for maximizing I/O bandwidth by leveraging the powerful storage stack on HPC systems to manage data sharing optimally among the tasks in the workflows. In particular, we devise a graph-based optimization algorithm that can leverage an intuitive graph representation of dataflow- and system-related information, and automatically carry out co-scheduling of task and data placement. According to our experiments, DFMan optimizes a wide variety of scientific workflows such as Hurricane 3D on Cloud Model 1 (CM1), Montage Carina Nebula (NGC3372), and an emulated dataflow kernel of the Multiscale Machine-learned Modeling Infrastructure (MuMMI I/O) on the Lassen supercomputer, and improves their aggregated I/O bandwidth by up to 5.42 x, 2.12 x and 1.29 x, respectively, compared to the baseline bandwidth.
more » « less
Full Text Available
O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

https://doi.org/10.1109/Cluster48925.2021.00054

Bhattacharya, Subhadeep; Yu, Weikuan; Chowdhury, Fahim Tahmid; Mohror, Kathryn (September 2021, 2021 IEEE International Conference on Cluster Computing (CLUSTER))

Full Text Available
Understanding a program's resiliency through error propagation

https://doi.org/10.1145/3437801.3441589

Li, Zhimin; Menon, Harshitha; Mohror, Kathryn; Bremer, Peer-Timo; Livant, Yarden; Pascucci, Valerio (February 2021, PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)

Full Text Available
Understanding HPC Application I/O Behavior Using System Level Statistics

https://doi.org/10.1109/HiPC50609.2020.00034

Paul, Arnab K.; Faaland, Olaf; Moody, Adam; Gonsiorowski, Elsa; Mohror, Kathryn; Butt, Ali R. (December 2020, IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC))

Full Text Available
SpotSDC: Revealing the Silent Data Corruption Propagation in High-performance Computing Systems

https://doi.org/10.1109/TVCG.2020.2994954

Li, Zhimin; Menon, Harshitha; Maljovec, Dan; Livnat, Yarden; Liu, Shusen; Mohror, Kathryn; Bremer, Peer-Timo; Pascucci, Valerio (October 2020, IEEE Transactions on Visualization and Computer Graphics)
null (Ed.)
Full Text Available
Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning

Dey, Tonmoy; Sato, Kento; Nicolae, Bogdan; Guo, Jian; Domke, Jens; Yu, Weikuan; Cappello, Franck; Mohror, Kathryn (May 2020, The IEEE International Workshop on High-Performance Storage)

With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize checkpoint/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
more » « less
Full Text Available
Improving I/O Performance of HPC Application Using Intra-Job Scheduling.

Paul, Arnab K.'; Moody, Adam; Gonsiorowski, Elsa; Mohror, Kathryn; Butt, Ali R. (November 2019, Proceedings of the 4th Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISC'19) in conjunction with SC'19)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records