 NSFPAR ID:
 10320005
 Date Published:
 Journal Name:
 Proceedings of the VLDB Endowment
 Volume:
 14
 Issue:
 8
 ISSN:
 21508097
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

null (Ed.)Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for blocksparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributedmemory parallel computation with blocksparse tensorial data is paramount to minimize the timetosolution (e.g., to study dynamical problems or for realtime analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulksynchronous parallel programming model. In this paper, we focus on the critical element of blocksparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the taskfocused PaRSEC runtime. High performance of the blocksparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.more » « less

Tensor contractions are ubiquitous in computational chemistry andphysics, where tensors generally represent states or operators andcontractions express the algebra of these quantities. In this context,the states and operators often preserve physical conservation laws,which are manifested as group symmetries in the tensors. These groupsymmetries imply that each tensor has block sparsity and can be storedin a reduced form. For nontrivial contractions, the memory footprint andcost are lowered, respectively, by a linear and a quadratic factor inthe number of symmetry sectors. Stateoftheart tensor contractionsoftware libraries exploit this opportunity by iterating over blocks orusing general blocksparse tensor representations. Both approachesentail overhead in performance and code complexity. With intuition aidedby tensor diagrams, we present a technique, irreducible representationalignment, which enables efficient handling of Abelian group symmetriesvia only dense tensors, by using contractionspecific reduced forms.This technique yields a general algorithm for arbitrary group symmetriccontractions, which we implement in Python and apply to a variety ofrepresentative contractions from quantum chemistry and tensor networkmethods. As a consequence of relying on only dense tensor contractions,we can easily make use of efficient batched matrix multiplication viaIntel’s MKL and distributed tensor contraction via the Cyclops library,achieving good efficiency and parallel scalability on up to 4096 KnightsLanding cores of a supercomputer.

The relational data model was designed to facilitate largescale data management and analytics. We consider the problem of how to differentiate computations expressed relationally. We show experimentally that a relational engine running an autodifferentiated relational algorithm can easily scale to very large datasets, and is competitive with stateoftheart, specialpurpose systems for largescale distributed machine learning.more » « less

The relational data model was designed to facilitate largescale data management and analytics. We consider the problem of how to differentiate computations expressed relationally. We show experimentally that a relational engine running an autodifferentiated relational algorithm can easily scale to very large datasets, and is competitive with stateoftheart, specialpurpose systems for largescale distributed machine learning.more » « less

Poole, Steve ; Hernandez, Oscar ; Baker, Matthew ; Curtis, Tony (Ed.)SHMEMML is a domain specific library for distributed array computations and machine learning model training & inference. Like other projects at the intersection of machine learning and HPC (e.g. dask, Arkouda, Legate Numpy), SHMEMML aims to leverage the performance of the HPC software stack to accelerate machine learning workflows. However, it differs in a number of ways. First, SHMEMML targets the full machine learning workflow, not just model training. It supports a general purpose ndarray abstraction commonly used in Python machine learning applications, and efficiently distributes transformation and manipulation of this ndarray across the full system. Second, SHMEMML uses OpenSHMEM as its underlying communication layer, enabling high performance networking across hundreds or thousands of distributed processes. While most past work in high performance machine learning has leveraged HPC message passing communication models as a way to efficiently exchange model gradient updates, SHMEMML’s focus on the full machine learning lifecycle means that a more flexible and adaptable communication model is needed to support both fine and coarse grain communication. Third, SHMEMML works to interoperate with the broader Python machine learning software ecosystem. While some frameworks aim to rebuild that ecosystem from scratch on top of the HPC software stack, SHMEMML is built on top of Apache Arrow, an inmemory standard for data formatting and data exchange between libraries. This enables SHMEMML to share data with other libraries without creating copies of data. This paper describes the design, implementation, and evaluation of SHMEMML – demonstrating a general purpose system for data transformation and manipulation while achieving up to a 38× speedup in distributed training performance relative to the industry standard Horovod framework without a regression in model metrics.more » « less