LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.
more »
« less
High-performance symbolic-numerics via multiple dispatch
As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge of code optimization. Naturally, users need different term types either to have different algebraic properties for them, or to use efficient data structures. To this end, we developed Symbolics.jl, an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs. In this work we detail an underlying abstract term interface which allows for speed without sacrificing generality. We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system without changing the pre-existing term rewriters. We showcase how this can be used to optimize term construction and give a 113x acceleration on general symbolic transformations. Further, we show that such a generic API allows for complementary term-rewriting implementations. Exploiting this feature, we demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers. We illustrate how this symbolic system improves numerical computing tasks by showcasing an e-graph ruleset which minimizes the number of CPU cycles during expression evaluation, and demonstrate how it simplifies a real-world reaction-network simulation to halve the runtime. Additionally, we show a reaction-diffusion partial differential equation solver which is able to be automatically converted into symbolic expressions via multiple dispatch tracing, which is subsequently accelerated and parallelized to give a 157x simulation speedup. Together, this presents Symbolics.jl as a next-generation symbolic-numeric computing environment geared towards modeling and simulation.
more »
« less
- Award ID(s):
- 1835443
- PAR ID:
- 10387777
- Date Published:
- Journal Name:
- ACM Communications in Computer Algebra
- Volume:
- 55
- Issue:
- 3
- ISSN:
- 1932-2240
- Page Range / eLocation ID:
- 92 to 96
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.more » « less
-
Modern machine learning frameworks support very large models by incorporating parallelism and optimization techniques. Yet, these very techniques add new layers of complexity in ensuring the correctness of the computation. An incorrect implementation of these techniques might lead to compile-time or runtime errors that can easily be observed and fixed, but it might also lead to silent errors that will result in incorrect computations in training or inference, which do not exhibit any obvious symptom until the model is used later. These subtle errors not only waste computation resources, but involve significant developer effort to detect and diagnose. In this work, we propose Aerify, a framework to automatically expose silent errors by verifying semantic equivalence of models with equality saturation. Aerify constructs equivalence graphs (e-graphs) from intermediate representations of tensor programs, and incrementally applies rewriting rules---derived from generic templates and refined via domain-specific analysis---to prove or disprove equivalence at scale. When discrepancies remain unproven, Aerify pinpoints the corresponding graph segments and maps them back to source code, simplifying debugging and reducing developer overhead. Our preliminary results show strong potentials of Aerify in detecting real-world silent errors.more » « less
-
Roy, Sudeepa; Kara, Ahmet (Ed.)Equality saturation is an emerging technique for program and query optimization developed in the programming language community. It performs term rewriting over an E-graph, a data structure that compactly represents a program space. Despite its popularity, the theory of equality saturation lags behind the practice. In this paper, we define a fixpoint semantics of equality saturation based on tree automata and uncover deep connections between equality saturation and the chase. We characterize the class of chase sequences that correspond to equality saturation. We study the complexities of terminations of equality saturation in three cases: single-instance, all-term-instance, and all-E-graph-instance. Finally, we define a syntactic criterion based on acyclicity that implies equality saturation termination.more » « less
-
The majority of computer algebra systems (CAS) support symbolic integration using a combination of heuristic algebraic and rule-based (integration table) methods. In this paper, we present a hybrid (symbolic-numeric) method to calculate the indefinite integrals of univariate expressions. Our method is broadly similar to the Risch-Norman algorithm. The primary motivation for this work is to add symbolic integration functionality to a modern CAS (the symbolic manipulation packages of SciML, the Scientific Machine Learning ecosystem of the Julia programming language), which is designed for numerical and machine learning applications. The symbolic part of our method is based on the combination of candidate terms generation (ansatz generation using a methodology borrowed from the Homotopy operators theory) combined with rule-based expression transformations provided by the underlying CAS. The numeric part uses sparse regression, a component of the Sparse Identification of Nonlinear Dynamics (SINDy) technique, to find the coefficients of the candidate terms. We show that this system can solve a large variety of common integration problems using only a few dozen basic integration rules.more » « less
An official website of the United States government

