Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads

Merzky, Andre (ORCID:0000000272284327); Titov, Mikhail (ORCID:0000000323577382); Turilli, Matteo (ORCID:0000000305271435); Jha, Shantenu (ORCID:000000025040026X)

doi:10.1145/3731599.3767587

Throughout the cyberinfrastructure community there are a large range of resources available to train faculty and young scholars about successful utilization of computational resources for research. The challenge that the community faces is that training materi- als abound, but they can be difficult to find, and often have little information about the quality or relevance of offerings. Building on existing software technology, we propose to build a way for the community to better share and find training and education materials through a federated training repository. In this scenario, organizations and authors retain physical and legal ownership of their materials by sharing only catalog information, organizations can refine local portals to use the best and most appropriate ma- terials from both local and remote sources, and learners can take advantage of materials that are reviewed and described more clearly. In this paper, we introduce the HPC ED pilot project, a federated training repository that is designed to allow resource providers, campus portals, schools, and other institutions to both incorporate training from multiple sources into their own familiar interfaces and to publish their local training materials.

More Like this