Search for: All records

Creators/Authors contains: "Mohror, Kathryn"

« Prev Next »

Total Resources

13

Resource Type
Conference Paper

9

Conference Proceeding

0

Dataset

0

Journal Article

4

Workshop Report

0

Availability
Full Text / Resource Available

13

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Visual Comparison of Silent Error Propagation

https://doi.org/10.1109/TVCG.2022.3230636

Li, Zhimin ; Menon, Harshitha ; Mohror, Kathryn ; Liu, Shusen ; Guo, Luanzheng ; Bremer, Peer-Timo ; Pascucci, Valerio ( December 2022 , IEEE Transactions on Visualization and Computer Graphics)

Full Text Available
DFMan: A Graph-based Optimization of Dataflow Scheduling on High-Performance Computing Systems

https://doi.org/10.1109/IPDPS53621.2022.00043

Chowdhury, Fahim ; Di Natale, Francesco ; Moody, Adam ; Mohror, Kathryn ; Yu, Weikuan ( May 2022 , 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS))

Scientific research and development campaigns are materialized by workflows of applications executing on high-performance computing (HPC) systems. These applications con-sist of tasks that can have inter- or intra-application flows of data to achieve the research goals successfully. These dataflows create dependencies among the tasks and cause resource con-tention on shared storage systems, thus limiting the aggregated I/O bandwidth achieved by the workflow. However, these I/O performance issues are often solved by tedious and manual efforts that demand holistic knowledge about the data dependencies in the workflow and the information about the infrastructure being utilized. Taking this into consideration, we design DFMan, a graph-based dataflow management and optimization framework for maximizing I/O bandwidth by leveraging the powerful storage stack on HPC systems to manage data sharing optimally among the tasks in the workflows. In particular, we devise a graph-based optimization algorithm that can leverage an intuitive graph representation of dataflow- and system-related information, and automatically carry out co-scheduling of task and data placement. According to our experiments, DFMan optimizes a wide variety of scientific workflows such as Hurricane 3D on Cloud Model 1 (CM1), Montage Carina Nebula (NGC3372), and an emulated dataflow kernel of the Multiscale Machine-learned Modeling Infrastructure (MuMMI I/O) on the Lassen supercomputer, and improves their aggregated I/O bandwidth by up to 5.42 x, 2.12 x and 1.29 x, respectively, compared to the baseline bandwidth.
more » « less
Full Text Available
O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

https://doi.org/10.1109/Cluster48925.2021.00054

Bhattacharya, Subhadeep ; Yu, Weikuan ; Chowdhury, Fahim Tahmid ; Mohror, Kathryn ( September 2021 , 2021 IEEE International Conference on Cluster Computing (CLUSTER))

Full Text Available
Understanding a program's resiliency through error propagation

https://doi.org/10.1145/3437801.3441589

Li, Zhimin ; Menon, Harshitha ; Mohror, Kathryn ; Bremer, Peer-Timo ; Livant, Yarden ; Pascucci, Valerio ( February 2021 , PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)

Full Text Available
Understanding HPC Application I/O Behavior Using System Level Statistics

https://doi.org/10.1109/HiPC50609.2020.00034

Paul, Arnab K. ; Faaland, Olaf ; Moody, Adam ; Gonsiorowski, Elsa ; Mohror, Kathryn ; Butt, Ali R. ( December 2020 , IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC))

Full Text Available
SpotSDC: Revealing the Silent Data Corruption Propagation in High-performance Computing Systems

https://doi.org/10.1109/TVCG.2020.2994954

Li, Zhimin ; Menon, Harshitha ; Maljovec, Dan ; Livnat, Yarden ; Liu, Shusen ; Mohror, Kathryn ; Bremer, Peer-Timo ; Pascucci, Valerio ( October 2020 , IEEE Transactions on Visualization and Computer Graphics)
null (Ed.)
Full Text Available
Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning

Dey, Tonmoy ; Sato, Kento ; Nicolae, Bogdan ; Guo, Jian ; Domke, Jens ; Yu, Weikuan ; Cappello, Franck ; Mohror, Kathryn ( May 2020 , The IEEE International Workshop on High-Performance Storage)

With the emergence of versatile storage systems, multi-level checkpointing (MLC) has become a common approach to gain efficiency. However, multi-level checkpoint/restart can cause enormous I/O traffic on HPC systems. To use multilevel checkpointing efficiently, it is important to optimize checkpoint/restart configurations. Current approaches, namely modeling and simulation, are either inaccurate or slow in determining the optimal configuration for a large scale system. In this paper, we show that machine learning models can be used in combination with accurate simulation to determine the optimal checkpoint configurations. We also demonstrate that more advanced techniques such as neural networks can further improve the performance in optimizing checkpoint configurations.
more » « less
Full Text Available
Improving I/O Performance of HPC Application Using Intra-Job Scheduling.

Paul, Arnab K.' ; Moody, Adam ; Gonsiorowski, Elsa ; Mohror, Kathryn ; Butt, Ali R. ( November 2019 , Proceedings of the 4th Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISC'19) in conjunction with SC'19)
null (Ed.)
Full Text Available
Efficient User-Level Storage Disaggregation for Deep Learning

https://doi.org/10.1109/CLUSTER.2019.8891023

Zhu, Yue ; Yu, Weikuan ; Jiao, Bing ; Mohror, Kathryn ; Moody, Adam ; Chowdhury, Fahim ( September 2019 , 2019 IEEE International Conference on Cluster Computing (CLUSTER))

On large-scale high performance computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a specialized Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization.
more » « less
Full Text Available
I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning

https://doi.org/10.1145/3337821.3337902

Chowdhury, Fahim ; Zhu, Yue ; Heer, Todd ; Paredes, Saul ; Moody, Adam ; Goldstone, Robin ; Mohror, Kathryn ; Yu, Weikuan ( August 2019 , ICPP 2019: Proceedings of the 48th International Conference on Parallel Processing)

Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.
more » « less
Full Text Available

« Prev Next »