<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra</title></titleStmt>
			<publicationStmt>
				<publisher>Conference on Neural Information Processing Systems (NeurIPS) 2023</publisher>
				<date>12/10/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10535867</idno>
					<idno type="doi"></idno>
					
					<author>Andres Potapczynski</author><author>Marc Finzi</author><author>Geoff Pleiss</author><author>Andrew Wilson</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Many areas of machine learning and science involve large linear algebra problems, such as eigendecompositions, solving linear systems, computing matrix exponentials, and trace estimation. The matrices involved often have Kronecker, convolutional, block diagonal, sum, or product structure. In this paper, we propose a simple but general framework for large-scale linear algebra problems in machine learning, named CoLA (Compositional Linear Algebra). By combining a linear operator abstraction with compositional dispatch rules, CoLA automatically constructs memory and runtime efficient numerical algorithms. Moreover, CoLA provides memory efficient automatic differentiation, low precision computation, and GPU acceleration in both JAX and PyTorch, while also accommodating new objects, operations, and rules in downstream packages via multiple dispatch. CoLA can accelerate many algebraic operations, while making it easy to prototype matrix structures and algorithms, providing an appealing drop-in tool for virtually any computational effort that requires linear algebra. We showcase its efficacy across a broad range of applications, including partial differential equations, Gaussian processes, equivariant model construction, and unsupervised learning.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The framework of automatic differentiation has revolutionized machine learning. Although the rules that govern derivatives have long been known, automatically computing derivatives was a nontrivial process that required (1) efficient implementations of base-case primitive derivatives, (2) software abstractions (autograd and computation graphs) to compose these primitives into complex computations, and (3) a mechanism for users to modify or extend compositional rules to new functions. Once libraries such as PyTorch, Chainer, Tensorflow, JAX, and others <ref type="bibr">[1,</ref><ref type="bibr">8,</ref><ref type="bibr">30,</ref><ref type="bibr">31,</ref><ref type="bibr">38,</ref><ref type="bibr">47]</ref> figured out the correct abstractions, the impact was enormous. Efforts that previously went into deriving and implementing gradients could be repurposed into developing new models.</p><p>In this paper, we automate another notorious bottleneck for ML methods: performing large-scale linear algebra (e.g. matrix solves, eigenvalue problems, nullspace computations). These ubiquitous operations are at the heart of principal component analysis, Gaussian processes, normalizing flows, equivariant neural networks, and many other applications <ref type="bibr">[2,</ref><ref type="bibr">12,</ref><ref type="bibr">13,</ref><ref type="bibr">17,</ref><ref type="bibr">18,</ref><ref type="bibr">27,</ref><ref type="bibr">28,</ref><ref type="bibr">34,</ref><ref type="bibr">37,</ref><ref type="bibr">39]</ref>. Modeling assumptions frequently manifest themselves as algebraic structure-such as diagonal dominance, sparsity, or a low-rank factorization. Given a structure (e.g., the sum of low-rank plus diagonal matrices) and a linear algebraic operation (e.g., linear solves), there is often a computational routine (e.g. the linear-time Woodbury inversion formula) with lower computational complexity than a general-purpose routine (e.g., the cubic-time Cholesky decomposition). However, exploiting</p><p>Simple Operators Composition Operators Base Case D T P C S Pr A 0 0 B A B C D A -1 Eigs(A) Diag(A) Tr(A) exp(A) det(A)</p><p>Table <ref type="table">1</ref>: Many structures have explicit composition rules to exploit. Here we show the existence of a dispatch rule ( ) that can be used to accelerate a linear algebraic operation for some matrix structure over what is possible with the dense and iterative base cases. Many combinations (shown with ) are automatically accelerated as a consequence of other rules, since for example Eigs and Diag are used in other routines. In absence of a rule, the operation will fall back to the iterative and dense base case for each operation (shown in ). Columns are basic linear operator types such as D: Diagonal, T: Triangular, P: Permutation, C: Convolution, S:Sparse, Pr: Projection and composition operators such as sum, product, Kronecker product, block diagonal and concatenation.</p><p>All compositional rules can be mixed and matched and are implemented through multiple dispatch.</p><p>structure for faster computation is often an intensive implementation process. Rather than having an object A in code that represents a low-rank-plus-diagonal matrix and simply calling solve(A, b), a practitioner must instead store the low-rank factor F as a matrix, the diagonal d as a vector, and implement the Woodbury formula from scratch. Implementing structure-aware routines in machine learning models is often seen as a major research undertaking. For example, a nontrivial portion of the Gaussian process literature is devoted to deriving specialty inference algorithms for structured kernel matrices [e.g. <ref type="bibr">7,</ref><ref type="bibr">11,</ref><ref type="bibr">19,</ref><ref type="bibr">25,</ref><ref type="bibr">29,</ref><ref type="bibr">46,</ref><ref type="bibr">52,</ref><ref type="bibr">53,</ref><ref type="bibr">24]</ref>.</p><p>As with automatic differentiation, structure-aware linear algebra is ripe for automation. We introduce a general numerical framework that dramatically simplifies implementations efforts while achieving a high degree of computational efficiency. In code, we represent structure matrices as LinearOperator objects which adhere to the same API as standard dense matrices. For example, a user can call A -1 b or eig(A) on any LinearOperator A, and under-the-hood our framework derives a computationally efficient algorithm built from our set of compositional dispatch rules (see Table <ref type="table">1</ref>). If little is known about A, the derived algorithm reverts to a general-purpose base case (e.g. Gaussian elimination or GMRES for linear solves). Conversely, if A is known to be the Kronecker product of a lower triangular matrix and a positive definite Toeplitz matrix, for example, the derived algorithm uses specialty algorithms for Kronecker, triangular, and positive definite matrices. Through this compositional pattern matching, our framework can match or outperform special-purpose implementations across numerous applications despite relying on only a small number of base LinearOperator types.</p><p>Furthermore, our framework offers additional novel functionality that is necessary for ML applications (see Table <ref type="table">2</ref>). In particular, we automatically compute gradients, diagonals, transposes and adjoints of linear operators, and we modify classic iterative algorithms to ensure numerical stability in low precision. We also support specialty algorithms, such as SVRG <ref type="bibr">[23]</ref> and a novel variation of Hutchinson's diagonal estimator <ref type="bibr">[22]</ref>, which exploit implicit structure common to matrices in machine learning applications (namely, the ability to express matrices as large-scale sums amenable to stochastic approximations). Moreover, our framework is easily extensible in both directions: a user can implement a new linear operator (i.e. one column in Table <ref type="table">1</ref>), or a new linear algebraic operation (i.e. one row in Table <ref type="table">1</ref>). Finally, our routines benefit from GPU and TPU acceleration and apply to symmetric and non-symmetric operators for both real and complex numbers.</p><p>We term our framework CoLA (Compositional Linear Algebra), which we package in a library that supports both PyTorch and JAX. We showcase the extraordinary versatility of CoLA with a broad range of applications in Section 3.</p><p>2 and Section 4, including: PCA, spectral clustering, multi-task Gaussian processes, equivariant models, neural PDEs, random Fourier features, and PDEs like minimal surface or the Schr&#246;dinger equation. Not only does CoLA provide competitive performance to specialized packages but it provides significant speedups especially in applications with compositional structure (Kronecker, block diagonal, product, etc). Our package is available at <ref type="url">https://github.com/wilson-labs/cola</ref>. 2 Background and Related Work Structured matrices Structure appears throughout machine learning applications, either occurring naturally through properties of the data, or artificially as a constraint to simplify complexity. A nonexhausitve list of examples includes: (1) low-rank matrices, which admit efficient solves and determinants [54]; (2) sparse matrices, which admit fast methods for linear solves and eigenvalue problems [14, 44]; (3) Kronecker-factorizable matrices, which admit efficient spectral decompositions;</p><p>(4) Toeplitz or circulant matrices, which admit fast matrix-vector products. See Section 3 and Section 4 for applications that use these structures. Beyond these explicit types, we also consider implicit structures, such as matrices with clustered eigenvalues or matrices with simple unbiased estimates. Though these implicit structures do not always fall into straightforward categorizations, it is possible to design algorithms that exploit their inherent properties (see Section 3.3).</p><p>Iterative matrix-free algorithms Unlike direct methods, which typically require dense instantiations of matrices, matrix-free algorithms only access matrices through routines that perform matrix-vector multiples (MVMs) [e.g. 44]. The most common matrix-free algorithms-such as conjugate gradients, GMRES, Lanczos and Arnoldi iteration-fall under the category of Krylov subspace methods, which iteratively apply MVMs to refine a solution until a desired error tolerance is achieved. Though the rate of convergence depends on the conditioning or spectrum of the matrix, the number of iterations required is often much less than the size of the matrix. These algorithms often provide significant computational speedups for structured matrices that admit sub-quadratic MVMs (e.g. sparse, circulant, Toeplitz, etc.) or when using accelerated hardware (GPUs or TPUs) designed for efficient parallel MVMs [e.g.</p><p>10, 20, 51]. Multiple dispatch Popularized by Julia [6], multiple dispatch is a functional programming paradigm for defining type-specific behaviors. Under this paradigm, a given function (e.g. solve) can have multiple definitions, each of which are specific to a particular set of input types. A base-case definition solve[LinearOperator] would use a generic matrix-vector solve algorithm (e.g. Gaussian elimination or GMRES), while a type-specific definition (e.g. solve[Sum], for sums of matrices) would use a special purpose algorithm that makes use of the subclass' structure (e.g. SVRG, see Section 3.3). When a user calls solve(A, b) at runtime, the dispatcher determines which definition of solve to use based on the types of A and b. Crucially, dispatch rules can be written for compositional patterns of types. For example, a solve[Sum[LowRank, Diagonal]] function will apply the Woodbury formula to a Sum operator that composes LowRank and Diagonal matrices. (In contrast, under an inheritance paradigm, one would need to define a specific SumOfLowRankAndDiagonal sub-class that uses the Woodbury formula, rather than relying on the composition of general purpose types.)</p><p>Existing frameworks for exploiting structure Achieving fast computations with structured matrices is often a manual effort. Consider for example the problems of second order/natural gradient optimization, which require matrix solves with (potentially large) Hessian/Fisher matrices.</p><p>Researchers have proposed tackling these solves with matrix-free methods <ref type="bibr">[33]</ref>, diagonal approximations [e.g. 4], low-rank approximations [e.g. 42], or Kronecker-factorizable approximations <ref type="bibr">[34]</ref>. Despite their commonality-relying on structure for fast solves-all methods currently require different implementations, reducing interoperability and adding overhead to experimenting with new structured approximations. As an alternative, there are existing libraries like SciPy Sparse <ref type="bibr">[50]</ref>, Spot <ref type="bibr">[49]</ref>, PyLops <ref type="bibr">[41]</ref>, or GPyTorch <ref type="bibr">[20]</ref>, which offer a unified interface for using matrix-free algorithms with any type of structured matrices. A user provides an efficient MVM function for a given matrix and then chooses the appropriate iterative method (e.g. conjugate gradients or GMRES) to perform the desired operation (e.g. linear solve). With these libraries, a user can adapt to different structures simply by changing the MVM routine. However, this increased interoperability comes at the cost of efficiency, as the iterative routines are not optimal for every type of structure</p><p>. (For example, Kronecker products admit efficient inverses that are asymptotically faster than conjugate gradients; see Figure 1.) Moreover, these libraries often lack modern features (e.g. GPU acceleration or automatic differentiation) or are specific to certain types of matrices (see Table 2). Package GPU Support Autograd Non-symmetric Matrices Complex Numbers Randomized Algorithms Composition Rules</p><p>Table 2: Comparison of scalable linear algebra libraries. PyLops only supports propagating gradients through vectors but not through the linear operator's parameters. Moreover, PyLops has limited GPU support through CUPY, but lacks support for PyTorch, JAX or TensorFlow which are necessary for modern machine learning applications.</p><p>3 CoLA: Compositional Linear Algebra</p><p>We now discuss all the components that make CoLA. In Section 3.1 we first describe the core MVM based LinearOperator abstraction, and in Section 3.2 we discuss our core compositional framework for identifying and automatically exploiting structure for fast computations. In Section 3.3, we highlight how CoLA exploits structure frequently encountered in ML applications beyond wellknown analytic formulae (e.g. the Woodbury identity). Finally, in Section 3.4 we present CoLA's machine learning-specific features, like automatic differentiation, support for low-precision, and hardware acceleration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Deriving Linear Algebraic Operations Through Fast MVMs</head><p>Borrowing from existing frameworks like Scipy Sparse, the central object of our framework is the LinearOperator: a linear function on a finite dimensional vector space, defined by how it acts on vectors via a matrix-vector multiply MVM A : v &#8594; Av. While this function has a matrix representation for a given basis, we do not need to store or compute this matrix to perform a MVM. Avoiding the dense representation of the operator saves memory and often compute.</p><p>Some basic examples of LinearOperators are: unstructured Dense matrices, which are represented by a 2-dimensional array and use the standard MVM routine [Av] i = j=1 A ij v j ; Sparse matrices, which can be represented by key/value arrays of the nonzero entries with the standard CSR-sparse MVM routine; Diagonal matrices, which are represented by a 1-dimensional array of the diagonal entries and where the MVM is given by [Diag(d)v] i = d i v i ; Convolution operators, which are represented by a convolutional filter array and where the MVM is given by Conv(a)v = a * v ; or JVP operatorsthe Jacobian represented implicitly through an autograd Jacobian Vector Product-represented by a function and an input x and where the MVM is given by Jacobian(f, x)v = JVP(f, x, v). In CoLA, each of these examples are sub-classes of the LinearOperator superclass.</p><p>Through the LinearOperator's MVM, it is possible to derive other linear algebraic operations. As a simple example, we obtain the dense representation of the LinearOperator by calling MVM(e 1 ), . . ., MVM(e N ), on each unit vector e i . We now describe several key operations supported by our framework, some well-established, and others novel to CoLA.</p><p>Solves, eigenvalue problems, determinants, and functions of matrices As a base case for larger matrices, CoLA uses Krylov subspace methods (Section 2, Appendix C) for many matrix operations. Specifically, we use GMRES <ref type="bibr">[43]</ref> for matrix solves and Arnoldi <ref type="bibr">[3]</ref> for finding eigenvalues, determinants, and functions of matrices. Both of these algorithms can be applied to any non-symmetric and/or complex linear operator. When LinearOperators are annotated with additional structure (e.g. self-adjoint, positive semi-definite) we use more efficient Krylov algorithms like MINRES, conjugate gradients, and Lanczos (see Section 3.2). As stated in Section 2, these algorithms are</p><p>Table <ref type="table">3</ref>: CoLA selects the best rates for each operation or structure combination. Asymptotic runtimes resulting from dispatch rules on compositional linear operators in our framework. Listed operations are matrix vector multiplies, linear solves, and eigendecomposition. Here &#1013; denotes error tolerance. For a given operator of size N &#215; N , we denote &#964; as its MVM cost, s its linear solve cost, E its eigendecomposition cost and &#954; its condition number. A lower script indicates to which matrix the operation belongs to.</p><p>matrix free (and thus memory efficient), amenable to GPU acceleration, and asymptotically faster than dense methods. See Section C.2 for a full list of Krylov methods used by CoLA.</p><p>Transposes and complex conjugations In alternative frameworks like Scipy Sparse a user must manually define a transposed MVM v &#8594; A &#8890; v for linear operator objects. In contrast, CoLA uses a novel autograd trick to derive the transpose from the core MVM routine. We note that A &#8890; v is the vector-Jacobian product (VJP) of the vector v and the Jacobian &#8706;Aw/&#8706;w. Thus, the function transpose(A) returns a LinearOperator object that uses VJP(MVM A , 0, v) as its MVM. We extend this idea to Hermitian conjugates, using the fact that</p><p>Other operations In Section 3.3 we outline how to stochastically compute diagonals and traces of operators with MVMs, and in Section 3.4 we discuss a novel approach for computing memory-efficient derivatives of iterative methods through MVMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Implementation</head><p>CoLA implements all operations (solve, eig, logdet, transpose, conjugate, etc.) following a functional programming paradigm rather than as methods of the LinearOperator object. This is not a minor implementation detail: as we demonstrate in the next section, it is crucial for the efficiency and compositional power of our framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Beyond Fast MVMs: Exploiting Explicit Structure Using Composition Rules</head><p>While the GMRES algorithm can compute solves more efficiently than corresponding dense methods such as the Cholesky decomposition, especially with GPU parallelization and preconditioning, it is not the most efficient algorithm for many LinearOperators. For example, if A = Diag(a), then we know that A -1 = Diag(a -1 ) without needing to solve a linear system. Similarly, solves with triangular matrices can be inverted efficiently through back substitution, and solves with circulant matrices can be computed efficiently in the Fourier domain Conv(a) = F -1 Diag(F a)F (where F is the Fourier transform linear operator). We offer more examples in Table <ref type="table">1</ref> (left).</p><p>As described in Section 2, we use multiple dispatch to implement these special case methods. For example, we implement the solve[Diagonal], solve <ref type="bibr">[Triangular]</ref>, and solve[Circulant] dispatch rules using the efficient routines described above. If a specific LinearOperator subclass does not have a specific solve dispatch rule then we default to the base-case solve rule using GMRES. This behaviour also applies to other operations, such as logdet, eig, diagonal, etc.</p><p>The dispatch framework makes it easy to implement one-off rules for the basic LinearOperator sub-classes described in Section 3.1. However, its true power lies in the use of compositional rules, which we describe below.</p><p>Compositional Linear Operators In addition to the base LinearOperator sub-classes (e.g. Sparse, Diagonal, Convolution), our framework provides mechanisms to compose multiple LinearOperators together. Some frequently used compositional structures are Sum</p><p>Each of these compositional LinearOperators are defined by (1) the base LinearOperator objects to be composed, and (2) a corresponding MVM routine, which is typically written in terms of the MVMs of the composed LinearOperators. For example, MVM Sum = v &#8594; i MVM i (v), where MVM i are the MVM routines for the component LinearOperators.</p><p>Dispatch rules for compositional operators are especially powerful. For example, consider Kronecker products where we have the rule (A &#8855; B) -1 = A -1 &#8855; B -1 . Though simple, this rule yields highly efficient routines for numerous structures. For example, suppose we want to solve (A&#8855;B&#8855;C)x = b where A is dense, B is diagonal, and C is triangular. From the rules, the solve would be split over the product, using GMRES for A, diagonal inversion for B, and forward substitution for C. This breakdown is much more efficient than the base case (GMRES with MVM Kron ).</p><p>When exploited to their full potential, these composition rules provide both asymptotic speedups (shown in Table <ref type="table">3</ref>) as well as runtime improvements on real problems across practical sizes (shown in Figure <ref type="figure">1</ref>). Splitting up the problem with composition rules yields speedups in surprising ways even in the fully iterative case. To illustrate, consider one large CG solve with the matrix power B = A n ; in general, the runtime is upper-bounded by O(n&#964; &#8730; &#954; n log 1 &#1013; ), where &#964; is the time for a MVM with A, &#954; is the condition number of A, and &#1013; is the desired error tolerance. However, splitting the product via a composition rule into a sequence of solves has a much smaller upper-bound of O(n&#964; &#8730; &#954; log n &#1013; ). We observe this speedup in the solving the Bi-Poisson PDE shown in Figure <ref type="figure">1(b)</ref>.</p><p>Additional flexibly and efficiency via parametric typing A crucial advantage of multiple dispatch is the ability to write simple special rules for compositions of specific operators. While a general purpose solve[Sum] method (SVRG; see next section) yields efficiency over the GMRES base case, it is not the most efficient algorithm when the Sum operator is combining a LowRank and a Diagonal operator. In this case, the Woodbury formula would be far more efficient. To account for this, CoLA allows for dispatch rules on parametric types; that is, the user defines a solve[Sum[LowRank, Diagonal]] dispatch rule that is used if the Sum operator is specifically combining a LowRank and a Diagonal linear operator. Coding these rules without multiple dispatch would require specialty defining sub-classes like LowRankPlusDiagonal over the LinearOperator object, increasing complexity and hampering extendibility.</p><p>Decoration/annotation operators Finally, we include several decorator types that annotate existing LinearOperators with additional structure. For example, we define SelfAdjoint (Hermetian/symmetric), Unitary (orthonormal), and PSD (positive semi-definite) operators, each of which wraps an existing LinearOperator object. None of these decorators define a specialty MVM; however, these decorators can be used to define dispatch rules for increased efficiency. Taken together Our framework defines 16 base linear operators, 5 compositional linear operators, 6 decoration linear operators, and roughly 70 specialty dispatch rules for solve, eig, and other operations. (See Table <ref type="table">1</ref> for a short summary and Appendix A for a complete list of rules.) We note that these numbers are relatively small compared with existing solutions yet-as we demonstrate in Section 4-these operators and dispatch rules are sufficient to match or exceed performance of specialty implementations in numerous applications. Finally, we note that CoLA is extensible by users in both directions. A user can write their own custom dispatch rules, either to (1) define a new LinearOperator and special dispatch rules for it, or (2) to define a new algebraic operation for all LinearOperators, and crucially this requires no changes to the original implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Exploiting Implicit Structure in Machine Learning Applications</head><p>So far we have discussed explicit matrix structures and composition rules for which there are simple analytic formulas easily found in well-known references [e.g. <ref type="bibr">21,</ref><ref type="bibr">44,</ref><ref type="bibr">48]</ref>. However, current large systems-especially those found in machine learning-often have implicit structure and special properties that yield additional efficiencies. In particular, many ML problems give rise to linear operators composed of large summations which are amenable to stochastic algorithms. Below we outline two impactful general purpose algorithms used in CoLA to exploit this implicit structure.</p><p>Accelerating iterative algorithms on large sums with SVRG Stochastic gradient descent (SGD) is widely used for optimizating problems with very large or infinite sums to avoid having to traverse the full dataset per iteration. Like Monte Carlo estimation, SGD is very quick to converge to a few decimal places but very slow to converge to higher accuracies. When an exact solution is required on a problem with a finite sum, the stochastic variance reduced gradient (SVRG) algorithm <ref type="bibr">[23]</ref> is much more compelling, converging on strongly convex problems (and many others) at an exponential rate, with runtime O((1 + &#954;/M ) log 1 &#1013; ) where &#954; is the condition number and &#1013; is the desired accuracy. &#8710; composed with itself on grid of sizes up to N = 1000 2 . We use CG with a multi-grid &#945;SA preconditioner <ref type="bibr">[9]</ref> to solve the linear system required in this application. (c) Finding the nullspace of an equivariant MLP of a linear operator having block diagonal structure. Here, NullF refers to the iterative nullspace finder algorithm detailed in <ref type="bibr">[16]</ref>. We ran a 5-node symmetric operator S(5) as done in <ref type="bibr">[16]</ref> with MLP sizes up to 15K. See Appendix D for further details. (a) Eigenvalue convergence criteria against number of MVMs for computing the first principal component on Buzz (N = 430K, D = 77) using VR-PCA <ref type="bibr">[45]</ref>. (b) Solve relative residual against number of MVMs for a random Fourier features (RFFs) approximation <ref type="bibr">[40]</ref> to a RBF kernel with J = 1K features on Elevators (N = 12.5K, D = 18). (c) Solve relative residual against number of MVMs when applying Neural-IVP <ref type="bibr">[17]</ref> to the 2-dimensional wave equation equation as done in <ref type="bibr">[17]</ref>. See Appendix D for further details.</p><p>When the condition number and the number of elements in the sum is large, SVRG becomes a desirable alternative even to classical deterministic iterative algorithms such as CG or Lanczos whose runtimes are bounded by O( &#8730; &#954; log 1 &#1013; ). Figure <ref type="figure">2</ref> shows the impact of using SVRG to exploit the structure of different linear operators that are composed of large sums.</p><p>Stochastic diagonal and trace estimation with reduced variance Another case where we exploit implicit structure is when estimating the trace or the diagonal of a linear operator. While collecting the diagonal for a dense matrix is a trivial task, it is a costly algorithm for an arbitrary LinearOperator defined only through its MVM-it requires computing Diag(A) = N i=1 e i &#8857; Ae i where &#8857; is the Hadamard (elementwise) product. If we need merely an approximation or unbiased estimate of the diagonal (or the sum of the diagonal), we can instead perform stochastic diagonal estimation <ref type="bibr">[22]</ref> Diag(A) = 1 n n j=1 z j &#8857; Az j where the z j &#8712; R N are any randomly sampled probe vectors with covariance I. We extend this randomized estimator to use randomization both in the probes, and random draws from a sum when A = M i=1 A i : For sufficiently large problems, switching from dense to iterative algorithms provides consistent runtime reductions, especially on a GPU, where matrix multiplies can be effectively parallelized. We plot the ratio between the runtime of a linear algebra operation using CoLA or PyTorch on different hardware (CPU and GPU) divided by the runtime of using PyTorch CPU. For the linear solves, we use the matrix market sparse operator Trefethen; for the eigenvalue estimation, we use the matrix market sparse operator mhd4800b and, finally, for the log determinant computation, we use the matrix market sparse operator bcsstk18. We provide additional details in Section D.4.</p><p>In Section B.1 we derive the variance of this estimator and we show that it converges faster than the base Hutchinson estimator when applied Sum structures. We validate empirically this analysis in Figure <ref type="figure">5</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Automatic Differentiation and Machine Learning Readiness</head><p>Memory efficient auto-differentiation In ML applications, we want to backpropagate through operations like A -1 , Eigs(A), Tr(A), exp(A), log det(A). To achieve this, in CoLA we define a novel concept of the gradient of a LinearOperator which we detail in Appendix B. For routines like GMRES, SVRG, and Arnoldi, we utilize a custom backward pass that does not require backproagating through the iterations of these algorithms. This custom backward pass results in substantial memory savings (the computation graph does not have to store the intermediate iterations of these algorithms), which we demonstrate in Appendix B (Figure <ref type="figure">6</ref>).</p><p>Low precision linear algebra By default, all routines in CoLA support the standard float32 and float64 precisions. Moreover, many CoLA routines also support float16 and bfloat16 half precision using algorithmic modifications for increased stability. In particular, we use variants of the GMRES, Arnoldi, and Lanczos iterations that are less susceptible to instabilities that arise through orthogonalization <ref type="bibr">[44,</ref><ref type="bibr">Ch. 6</ref>] and we use the half precision variant of conjugate gradients introduced by Maddox et al. <ref type="bibr">[32]</ref>. See Appendix C for further details.</p><p>Multi framework support and GPU/TPU acceleration CoLA is compatible with both PyTorch and JAX. This compatibility not only makes our framework plug-and-play with existing implemented models, but it also adds GPU/TPU support, differentiating it from existing solutions (see Table <ref type="table">2</ref>). CoLA's iterative algorithms are the class of linear algebra algorithms that benefit most from hardware accelerators as the main bottleneck of these algorithms are the MVMs executed at each iteration, which can easily be parallelized on hardware such as GPUs. Figure <ref type="figure">3</ref> empirically shows the additional impact of hardware accelerators across different datasets and linear algebra operations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Applications</head><p>We now apply CoLA to an extensive list of applications showing the impact, value and broad applicability of our numerical linear algebra framework, as illustrated in Figure <ref type="figure">4</ref>. This list of applications encompasses PCA, linear regression, Gaussian processes, spectral clustering, and partial differential equations like the Schr&#246;dinger equation or minimal surface problems. In contrast to Section 3 (Figure <ref type="figure">1</ref> &amp; Figure <ref type="figure">2</ref>), the applications presented here have a basic structure (sparse, vector-product, etc) but not a compositional structure (Kronecker, product, block diagonal, etc). We choose these applications due to their popularity and heterogeneity (the linear operators have different properties: self-adjoint, positive definite, symmetric and non-symmetric), and to show that CoLA</p><p>k=5 k=10 k=20 PCA Components 10 1 10 0 10 1 Runtime (sec) CoLA sk 10 2 10 3 10 4 10 5 Dasetset Size 10 2 10 0 Runtime (sec) sk CoLA (GPU) CoLA (CPU) 0 10 20 30 40 50 Runtime (sec) 0 25 50 75 100 Epochs CoLA (ele) GP (ele) CoLA (kin) GP (kin) (a) PCA (b) Linear Regression (c) GPs 10 4 10 5 10 6 Edges 10 1 10 0 10 1 Runtime (sec) sk (B) sk (L) CoLA (L) CoLA (B) 10 2 10 3</p><p>Grid Size</p><p>10 1 10 1 Runtime (sec) SciPy CoLA 10 2 10 3 10 4 Grid Size 10 1 10 0 10 1 Runtime (sec) SciPy CoLA SciPy (JAX) (d) Spectral Clustering (e) Schr&#246;dinger Equation (f) Minimal Surface Figure 4: CoLA is easily applied to numerous applications with competitive performance. Here sk: sklearn, GP: GPyTorch and the tuple (N , D) denotes dataset size and dimensionality. (a): Runtime for PCA decomposition on Buzz (437.4K, 77). (b): Linear regression runtime on Song (386.5K, 90), where we run CoLA on both GPU and CPU. (c): Training efficiency (measure in epochs) on exact GP inference on Elevators (14K, 18) and Kin (20K, 8) on GPU. (d): Spectral clustering runtime on a citations graph (cit-HepPh) consisting on 34.5K nodes and 842K edges. sk(L) denotes sklearn's implicitly restarted Lanczos implementation and sk(A) denotes sklearn's LOBPCG with an algebraic multi-graph preconditioner (PyAMG) [5, 26]. CoLA(L) denotes our Lanczos implementation and CoLA(B) our LOBPCG implementation. (e): Runtimes for finding the smallest eigenfunctions expanding grids of a Schr&#246;dinger equation with an expanding finite difference grid. (f): Runtimes for solving the minimal surface equation via root finding on expanding grids. Here SciPy utilizes the ARPACK package, a highly-optimized Fortran implementation of the Arnoldi iteration, while SciPy JAX (the SciPy version integrated with JAX) and CoLA utilize python Arnoldi implementations. Appendix D expands on the experimental details.</p><p>performs in any application. We compare against several well-known libraries, sometimes providing runtime improvements but other times performing equally. This is remarkable as our numerical framework does not specialize in any of those applications (like GPyTorch) nor does it rely on Fortran implementations of high-level algorithms (like sklearn or SciPy). Below we describe each of the applications found in Figure <ref type="figure">4</ref>.</p><p>Principal Component Analysis PCA is a classical ML technique that finds the directions in the data that capture the most variance. PCA can be performed by computing the right singular vectors of X &#8712; R N &#215;D . When the number of data points N is very large, stochastic methods like SVRG in VR-PCA <ref type="bibr">[45]</ref> can accelerate finding the eigenvectors over SVD or Lanczos, as shown in Figure <ref type="figure">2(a)</ref>.</p><p>Spectral Clustering Spectral clustering <ref type="bibr">[36]</ref> finds clusters of individual nodes in a graph by analyzing the graph Laplacian L = D -W where D denotes a diagonal matrix containing the degree of the nodes and W the weights on the edges between nodes. This problem requires finding the smallest k eigenvectors of L. We run this experiment on the high energy physics arXiv paper citation graph (cit-HepPh).</p><p>Gaussian processes GPs are flexible nonparametric probabilistic models where inductive biases are expressed through a covariance (kernel) function. At its core, training a GP involves computing and taking gradients of the log determinant of a kernel log |K| and of a quadratic term y T K -1 y (where y is the vector of observations).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Schr&#246;dinger Equation</head><p>In this problem we characterize the spectrum of an atom or molecule by finding the eigenspectrum of a PDE operator in a Schrodinger equation H&#968; = E&#968;. After discretizing &#968; to a grid, we compute the smallest eigenvalues and eigenvectors of the operator H which for this experiment is non-symmetric as we perform a compactfying transform.</p><p>Minimal Surface Here we solve a set of nonlinear PDEs with the objective of finding the surface that locally minimizes its area under given boundary constraints. When applied to the graph of a function, the PDE can be expressed as f (z) = (1 + z 2 x )z yy -2z x z y z xy + (1 + z 2 y )z xx = 0 and solved by root finding on a discrete grid. Applying Newton-Raphson, we iteratively solve the non-symmetric linear system z &#8592; z -J -1 f (z) where J is the Jacobian of the PDE operator.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Bi-Poisson Equation</head><p>The Bi-Poisson equation &#8710; 2 u = &#961; is a linear boundary value PDE relevant in continuum mechanics, where &#8710; is the Laplacian. When discretized using a grid, the result is a large symmetric system to be solved. We show speedups from the product structure in Figure <ref type="figure">1(b)</ref>.</p><p>Neural PDEs Neural networks show promise for solving high dimensional PDEs. One approach for initial value problems requires advancing an ODE on the neural network parameters &#952;, where &#952; = M(&#952;) -1 F (&#952;) where M is an operator defined from Jacobian of the neural network which decomposes as the sum over data points M = 1 N i M i and where F is determined by the governing dynamics of the PDE <ref type="bibr">[15,</ref><ref type="bibr">17]</ref>. By leveraging the sum structure with SVRG, we provide further speedups over Finzi et al. <ref type="bibr">[17]</ref> as shown in Figure <ref type="figure">2(c)</ref>.</p><p>Equivariant Neural Network Construction As shown in <ref type="bibr">[16]</ref>, constructing the equivariant layers of a neural network for a given data type and symmetry group is equivalent to finding the nullspace of a large linear equivariance constraint Cv = 0, where the constraint matrix C is highly structured, being a block diagonal matrix of concatenated Kronecker products and Kronecker sums of sparse matrices. In Figure <ref type="figure">1</ref>(c) we show the empirical benefits of exploiting this structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>We have presented the CoLA framework for structure-aware linear algebraic operations in machine learning applications and beyond. Building on top of dense and iterative algorithms, we leverage explicit composition rules via multiple dispatch to achieve algorithmic speedups across a wide variety of practical applications. Algorithms like SVRG and a novel variation of Hutchinson's diagonal estimator exploit implicit structure common to large-scale machine learning problems. Finally, CoLA supports many features necessary for machine learning research and development, including memory efficient automatic differentiation, multi-framework support of both JAX and PyTorch, hardware acceleration, and lower precision.</p><p>While structure exploiting methods are used across different application domains, domain knowledge often does not cross between communities. We hope that our framework brings these disparate communities and ideas together, enabling rapid development and reducing the burden of deploying fast methods for linear algebra at scale. Much like how automatic differentiation simplified and accelerated the training of machine learning models-with custom autograd functions as the exception rather than the rule-CoLA has the potential to streamline scalable linear algebra.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix Outline</head><p>This Appendix is organized as follows:</p><p>&#8226; In Appendix A we describe various dispatch rules including the base rules, the composition rules and rules derived from other rules. &#8226; In Appendix B we provide an extended discussion of several noteworthy features of CoLA, such as doubly stochastic estimators and memory-efficient autograd implementation. &#8226; In Appendix C we include pseudo-code on various of the iterative methods incorporated in CoLA and discuss modifications to improve lower precision performance. &#8226; In Appendix D we expand on the details of the experiments in the main text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Dispatch Rules</head><p>We now present the linear algebra identities that we use to exploit structure in CoLA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1 Core Functions</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1.1 Inverses</head><p>We incorporate several identities for the compositional operators: product, Kronecker product, block diagonal and sum. For product we have (AB) -1 = (B -1 A -1 ) and for Kronecker product we have</p><p>In terms of block compositions we have the following identities:</p><p>Finally, for sum we have the Woodbury identity and its variants. Namely, for Woodbury we have</p><p>the Kailath variant where</p><p>and the rank one update via the Sherman-Morrison formula</p><p>Besides the compositional operators, we have some rules for some special operators. For example, for A = Diag (a) we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1.2 Eigendecomposition</head><p>We now assume that the matrices in this section are diagonalizable. That is,</p><p>In terms of the compositional operators, there is not a general rule for product or sum. However, for the Kronecker product we have Eigs(A&#8855;B) = &#923; A &#8855;&#923; B , V A &#8855;V B and for the Kronecker sum we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Finally, for block diagonal we have</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Diagonal</head><p>As a base case, if we need to compute Diag (A) for a general matrix A we may compute each diagonal element by e &#8890; i Ae i . Additionally, if A is large enough we switch to randomized estimation Diag(A) &#8776; (Z &#8857; AZ)1/N with Z &#8764; N (0, 1) d&#215;N where N is the number of samples used to approximate the diagonal. In terms of compositional operators, we have that for sum Diag (A + B) = Diag </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1.4 Transpose / Adjoint</head><p>As explained in Section 3.1, as a base case we have an automatic procedure to compute the transpose or adjoint of any operator A via autodiff. However, we also incorporate the following rules. For sum we have</p><p>In terms of block composition we have</p><p>Finally for the annotated operators we have the following rules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1.5 Pseudo-inverse</head><p>As a base case, if we need to compute A + , we may use SVD (A) = U, &#931;, V and therefore set A + = U&#931; + V * , where &#931; + inverts the nonzero diagonal scalars. If the size of A is too large, then we may use randomized SVD. Yet, it is uncommon to simply want A + , usually we want to solve a least-squares problem and therefore we can use solvers that are not as expensive to run as SVD. For the compositional operators we have the following identities. For product (AB) + =</p><p>A + AB + ABB + + and for Kronecker product we have (A &#8855; B)</p><p>For block diagonal we have</p><p>Finally, we have some identities that are mathematically trivial but that are necessary when recursively exploiting structure as that would save computation. For example, if Q is unitary we know that</p><p>and also if it is symmetric and PSD.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Derived Functions</head><p>Interestingly, the previous core functions allow us to derive multiple rules from the previous ones.</p><p>To illustrate, we have that</p><p>A and if A is both symmetric and PSD then f</p><p>where in both cases we used Eigs (A) = &#923; A , V A . Some example functions for PSD matrices are</p><p>Which also this rules allow us to define LogDet (A) = Tr (Log (A)).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Other matrix identities</head><p>We emphasize that there are a myriad more matrix identities that we do not intentionally include such as Tr(A + B) = Tr(A) + Tr(B) or Tr(AB) = Tr(BA) when A and B are squared. These additional cases are not part of our dispatch rules as either they are automatically computed from other rules (as in the first example) or they do not yield any computational savings (as in the second example).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Features in CoLA</head><p>B.1 Doubly stochastic diagonal and trace estimation Singly Stochastic Trace Estimator Consider the traditional stochastic trace estimator:</p><p>with each z j &#8764; N (0,</p><p>, with probe variables shared across elements of the sum.</p><p>Consider the quadratic form Q := z &#8890; Az, which for Gaussian random variables has a cumulant generating function of</p><p>From the generating function we can derive the mean and variance of this estimator:</p><p>is a sum of independent random draws of Q, we see:</p><p>Doubly Stochastic Trace Estimator For the doubly stochastic estimator, we choose probe variables which are sampled independently for each element of the sum:</p><p>Separating out the elements of the sum, we can write the estimator as</p><p>Taking derivatives we find that,</p><p>Assuming bounded moments on A i , then both </p><p>As the error of the estimator can be bounded by the square root of the variance, showing that while the error for Tr[Base] is O(1/ &#8730; n) (even when applied to sum structures), whereas the error for</p><p>, a significant asymptotic variance reduction.</p><p>The related stochastic diagonal estimator</p><p>achieves the same O(1/ &#8730; nm) convergence rate, though we omit this derivation for brevity as it is follows the same steps.</p><p>In Figure <ref type="figure">5</ref> we empirically how our doubly stochastic diagonal estimator outperforms the standard Hutchinson estimator.  For two different linear algebra operations A -1 &#952; b and log |A &#952; |, we show the runtime and peak memory utilization required to compute the derivatives as we increase the size of the problem. In all plots, we compare CoLA's autograd rules against the autograd default of backpropagating through each iteration of the solver (unrolled autodiff). Notably, using the custom autograd rules allows us to save substantial memory and runtime when performing the backwards pass.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Autograd rules for iterative algorithms</head><p>For machine learning applications, we want to seamlessly interweave linear algebra operations with automatic differentiation. The most basic strategy is to simply let the autograd engine trace through the operations and backpropagate accordingly. However, when using iterative methods like conjugate gradients or Lanczos, this naive approach is extremely memory inefficient and, for problems with many iterations, the cost can be prohibitive (as seen in Figure <ref type="figure">6</ref>). However, the linear algebra operations corresponding to inverse, eigendecomposition and trace estimation have simple closed form derivatives which we can implement to avoid the prohibitive memory consumption and reduce runtime.</p><p>Simply put, for an operation like f = CGSolve, CGSolve(A, b) = A -1 b we must define a Vector Jacobian Product:</p><p>&#8706;A , v &#8890; &#8706;f &#8706;b . However, for matrix-free linear operators, we cannot afford to store the dense matrix A, and thus neither can we store the gradients with respect to each of its elements! Instead we must (recursively) consider how the linear operator was constructed in terms of its differentiable arguments. In other words, we must flatten the tree structure of possibly nested differentiable arguments into a vector:</p><p>From this perspective, we consider A as a container or tree of its arguments &#952;, and define v &#8890; &#8706;f &#8706;A := unflatten[v &#8890; &#8706;f &#8706;&#952; ] which coincides with the usual definition for dense matrices. Applying to inverses, we can now write a simple VJP:</p><p>for v &#8890; &#8706;f &#8706;&#952; = v &#8890; (A -1 ) &#8890; (&#8706; &#952; A &#952; )A -1 b, and we will adopt this notation below for brevity. Doing so gives a memory cost which is constant in the number of solver iterations, and proportional to the memory used in the forward pass. Below we list the autograd rules for some of the iterative routines that we implement in CoLA with their VJP definitions.</p><p>In Figure <ref type="figure">6</ref> we show the practical benefits of our autograd rules. We take gradients of different linear solves A -1 &#952; b that were derived using conjugate gradients (CG), where each solve required an increasing number of CG iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Algorithmic Details</head><p>In this section we expand upon three different points introduced in the main paper. For the first point we argue why SVRG leads to gradients with reduced variants. For the second points we display all the iterative methods that we use as base algorithms in CoLA. Finally, for the third point we expand upon CoLA's strategy for dealing with the different numerical precisions that we support.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 SVRG</head><p>In simplest form, SVRG <ref type="bibr">[23]</ref> performs gradient descent with the varianced reduced gradient w &#8592; w -&#951;(g i (w) -g i (w 0 ) + g(w 0 ))</p><p>where g i represents the stochastic gradient evaluated at only a single element or minibatch of the sum, and g(w 0 ) is the full batch gradient evaluated at the anchor point w 0 which is recomputed at the end of each epoch with an updated anchor.</p><p>With different loss functions, we can use this update rule to solve symmetric or non-symmetric linear systems, to compute the top eigenvectors or even find the nullspace of a matrix. Despite the fact that the corresponding objectives are not strongly convex in the last two cases, it has been shown that running noisy optimization algorithms) as when solving linear algebra problems (where round-off error can lead us to poor solutions). Thus, it is an active area of research in NLA to derive routines which utilize lower precisions than float64 or that mix precisions in order to achieve better runtimes without a complete degradation of the quality of the solution.</p><p>In CoLA we take a two prong approach to deal with lower precisions in our NLA routines. First, we incorporate additional variants of well-known algorithms that propagate less round-off error at the expense of requiring more computation, as seen in Figure <ref type="figure">7</ref>. Second, we integrate novel variants of algorithms that are designed to be used on lower precisions such as the CG modification found in Maddox et al. <ref type="bibr">[32]</ref>. We now discuss the first approach.</p><p>As discussed in Section C.2, there are two algorithms that are key for eigendecompositions. The first is Arnoldi (applicable to any operator), and the second is Lanczos (for symmetric operators) -where actually Lanczos can be viewed as a simplified version of Arnoldi. Central to these algorithms is the use of an orthogonalization step which is well-known to be a source of numerical instability. One approach to aggressively ameliorate the propagation of round-off error during orthogonalization is to use Householder projectors, which is the strategy that we use in CoLA. Given a unitary vector u, a Householder projector (or Householder reflector) is defined as the following operator R = I -2uu * . When applied to a vector x the result Rx is basically a reflection of x over the u &#8890; space. To easily visualize this, suppose that x &#8712; R 2 and u = e 1 . Hence,</p><p>which is exactly the reflection of the vector across the axis generated by e 2 . Most notably, R is unitary RR * = I which can be easily verified from the definition. Being unitary is crucial as under the usual round-off error model, applying R to another matrix A does not worsen the already accumulated error E. Mathematically, &#8741;R (A + E) -RA&#8741; = &#8741;RE&#8741; = &#8741;E&#8741;, where the last equality results from basic properties of unitary matrices. We are going to use Arnoldi as an example of how Householder projectors are used during orthogonalization. In Figure <ref type="figure">7</ref> we have an example of two different variants of Arnoldi present in CoLA. The implementations are notably different and also it is easy to see how Algorithm 2 is more expensive than Algorithm 1. First, note that for Algorithm 2 we have two for loops (line 6 and line 8) whereas for Algorithm 1 we only have one (line 4-6). Worse, the two for loops in Algorithm 2 require more flops than the only for loop in Algorithm 1. Note that we do not always favor the more expensive but robust implementation of an algorithm as in some cases, like when running GMRES, the round-off error is not as impactful to the quality of the solution, and shorter runtimes are actually more desirable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Experimental Details</head><p>In this section we expand upon the details of all the experiments ran in the paper. Such details include the datasets that were used, the hyperparameters of different algorithms and the specific choices of algorithms used both for CoLA but also for the alternatives. We run each of the experiments 3 times and compute the mean dropping the first observation (as usually the first run contains some compiling time much is not too large). We do not display the standard deviation as those numbers are imperceptible for each experiment. In terms of hardware, the CPU experiments were run on an Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz and the GPU experiments were run on a NVIDIA GeForce RTX 2080 Ti.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.1 Datasets</head><p>Below we enumerate the datasets that we used in the various applications. Most of the datasets are sourced from the University of California at Irvine's (UCI) Machine Learning Respository that can be found here: <ref type="url">https://archive.ics.uci.edu/ml/datasets.php</ref>. Also, a community repo hosting these UCI benchmarks can be found here: <ref type="url">https://github.com/treforevans/uci_  datasets</ref> (we have no affiliation).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>1.</head><p>Elevators. This dataset is a modified version of the Ailerons dataset, where the goal is to to predict the control action on the ailerons of the aircraft. This UCI dataset consists of N = 14K observations and has D = 18 dimensions.</p><p>Algorithm 1 Arnoldi iteration 1: Inputs: A, q 0 = &#957; 0 / &#8741;&#957; 0 &#8741; where possibly &#957; 0 &#8764; N (0, I), maximum number of iterations T and tolerance &#1013; &#8712; (0, 1). 2: for j = 0 to T -1 do</p><p>3: &#957; j+1 &#8592; Aq j 4:</p><p>for i = 0 to j do 5:</p><p>&#957; j+1 &#8592; &#957; j+1 -h i,j q i 7:</p><p>end for 8:</p><p>if h j+1,j &lt; &#1013; then</p><p>10: stop 11: else 12: h j = R j &#957; j 6:</p><p>if j &lt; T then 8:</p><p>end if 10: end for 11: return H, Q = (q 0 | . . . |q T ) 12: function GET_HOUSEHOLDER_VEC(w, k) 13: u i = 0 for i &lt; k and u i = w i for i &gt; k.  2. Kin40K. The full name of this UCI dataset is Statlog (Shuttle) Data Set. This dataset contains information about NASA shuttle flights and we used a subset that consists of N = 40K observations and has D = 8 dimensions. 3. Buzz. The full name of this UCI dataset is Buzz in social media. This dataset consists of examples of buzz events from Twitter and Tom's Hardware. We used a subset consisting of N = 430K observations and has D = 77 dimensions. 4. Song. The full name of this UCI dataset is YearPredictionMSD. This dataset consists of N = 386.5K observations and it has D = 90 audio features such as 12 timbre average features and 78 timbre covariance features. 5. cit-HepPh. This dataset is based on arXiv's HEP-PH (high energy physics phenomenology) citation graph and can be found here: <ref type="url">https://snap.stanford.edu/data/cit-HepPh.  html</ref>. The dataset covers all the citations from January 1993 to April 2003 of |V | = 34, 549 papers, ultimately containing |E| = 421, 578 directed edges.</p><p>The notion of relationship that we used in our spectral clustering experiment creates a connection between two papers when at least one cites another (undirected symmetric graph). Therefore the dataset that we used has the same number of nodes but instead |E| = 841, 798 undirected edges.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Compositional experiments</head><p>This section pertains to the experiments of Section 3.2 displayed in Figure <ref type="figure">1</ref>. We now elaborate on each of Figure <ref type="figure">1</ref>'s panels.</p><p>(a) The multi-task GP problem exploits the structure of the following Kronecker operator K T &#8855; K X , where K T is a kernel matrix containing the correlation between the tasks and K X is a RBF kernel on the data. For this experiment, we used a synthetic Gaussian dataset where the train data x i &#8764; N (0, I D ) which has dimension D = 33, N = 1K and we used T = 11 tasks (where the tasks basically set the size of K T ). We used conjugate gradients (CG) as the iterative method, where we set the hyperparameters to a tolerance of 10 -6 and to a maximum number of iterations to 1K. We used the exact same hyperparameters for CoLA. (b) For the bi-poisson problem we set up the maximum grid to be N = 1000 2 . Since this PDE problem involves solving a symmetric linear system, we used CG as the iterative method with a tolerance of 10 -11 and a maximum number of iterations of 10K. The previous parameters also apply for CoLA. We note that PDE problems are usually solved to higher tolerances as the numerical error compounds as we advance the PDE.</p><p>(c) For the EMLP experiment we consider solving the equivariance constraints to find the equivariant linear layers of a graph neural network with 5 nodes. To solve this problem, we need to find the nullspace of a large structured constraint matrix. We use the uniformly channel heuristic from <ref type="bibr">[16]</ref> which distributes the N channels across tensors of different orders. We consider our approach which exploits the block diagonal structure, separating the nullspaces into blocks, as opposed to the direct iterative approach exploiting only the fast MVMs of the constraint matrix. We use a tolerance of 10 -5 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 Sum structure experiments</head><p>This section pertains to the experiments of Section 3.3 contained in Figure <ref type="figure">2</ref>. We now elaborate on each of Figure <ref type="figure">2</ref>'s panels.</p><p>(a) In this experiment we computed the first principal component of the Buzz dataset. For the iterative method we used power iteration with a maximum number of iterations of 300 and a stop tolerance of 10 -7 . CoLA used SVRG also with the same stop tolerance and maximum number of iterations. Additionally, we set SVRG's batch size to 10K and the learning rate to 0.0008. We note that a single power iteration roughly contains 43/2 = 21.5 times more MVMs than a single iteration of SVRG. In this particular case, the length of the sum is given by the number of observations and therefore SVRG uses 430/10 = 43 times less elements per iteration, where 10 comes from the 10K batch size. Finally, the 2 is explained by noting that SVRG incurs in a full sum update on every epoch.</p><p>(b) In this experiment we trained a GP by estimating the covariance RBF kernel with J = 1K random Fourier features (RFFs). The hyperparameters for the RBF kernel are the following: length scale (&#8467; = 0.1), output scale (a = 1) and likelihood noise (&#963; 2 = 0.1). Moreover, we used CG as the iterative solver with a tolerance of 10 -8 and 100 as the maximum number of iterations (the convergence took much less iterations than the max). For SVRG we used the same tolerance but set the maximum number of iterations to 10K, a batch size of 100 and learning rate of 0.004. We note that a single CG iteration roughly contains 10/2 = 5 times more MVMs than a single iteration of SVRG. In this particular case, the length of the sum is given by the number of RFFs and therefore SVRG uses 1000/100 = 10 times less elements per iteration, where 100 comes from the batch size.</p><p>(c) In this experiment we implemented the Neural-IVP method from Finzi et al. <ref type="bibr">[17]</ref>. We consider the time evolution of a wave equation in two spatial dimensions. At each integrator step, a linear system M(&#952;) &#952; = F (&#952;) must be solved to find &#952;, for a d = 12K &#215; 12K dimensional matrix. While Finzi et al. <ref type="bibr">[17]</ref> use conjugate gradients to solve the linear system, we demonstrate the advantages of using SVRG, as M(&#952;) = 1 m m i=1 M i (&#952;) is a sum over the evaluation at m = 50K distinct sample locations within the domain. In this experiment we use a batch size of 500 for SVRG, and employ rank 250 randomized Nystr&#246;m preconditioning for both SVRG and the iterative CG baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.4 Hardware speed-up comparisons</head><p>This section pertains to the experiments of Figure <ref type="figure">3</ref>. For all these experiments we computed the runtime reduction as a fraction between the time that it takes CoLA to run some linear algebra operation and PyTorch using the same hardware. As an example, assume that PyTorch takes 200 seconds to compute a solve using a CPU and 100 seconds to compute the same solve but now using a GPU. Moreover, assume that CoLA's iterative algorithm takes 100 seconds to compute the same solve on a CPU and 40 seconds on a GPU. Thus, the runtime reduction would be 100/200 = 0.5% for the CPU column whereas 40/100 = 0.4 for the GPU column.</p><p>1. Solves. In this experiment we calculated the % runtime reduction when running torch.linalg.solve on the Trefethen N = 20K matrix market sparse operator. In this experiment, CG was run with a tolerance of 10 -11 and a maximum number of iterations equal to the operator size.</p><p>inner step involves a linear solve of an non-symmetric operator. We compare against SciPy's GMRES implementation as well as JAX's integrated version of SciPy. The main difference between the two is that SciPy calls the fast and highly-optimized ARPACK library whereas SciPy (JAX) has its only Python implementation of GMRES which only uses JAX's primitives (equally as it is done in CoLA). The tolerance for this experiment was 5e-3. We see how CoLA's GMRES implementation is competitive with SciPy (JAX) but it still does not beat ARPACK mostly due to the faster runtime of using a lower level GMRES implementation.</p></div></body>
		</text>
</TEI>
